Try our new research platform with insights from 80,000+ expert users
Software Architect at Akbank
Real User
Provides fast aggregations, AI libraries, and a lot of connectors
Pros and Cons
  • "AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI."
  • "Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."

What is our primary use case?

We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. 

Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them.  

This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.

How has it helped my organization?

Aggregations are very fast in our project since we started to use Spark. We can tell results in around 300 milliseconds. Before using Spark, the time was around 700 milliseconds. 

Before using Spark, we only used Couchbase. We needed fast results for this project because transactions come from various channels, and we need to decide and resolve them at the earliest because users are performing the transaction. If our result or process takes longer, users might stop or cancel their transactions, which means losing money. Therefore, fast results time is very important for us.

What is most valuable?

AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI. 

What needs improvement?

Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.

Buyer's Guide
Apache Spark
June 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
856,873 professionals have used our research since 2012.

For how long have I used the solution?

I am a Java developer. I have been interested in Spark for around five years. We have been actively using it in our organization for almost a year.

What do I think about the stability of the solution?

It is the most stable platform. As compare to Flink, Spark is good, especially in terms of clusters and architecture. My colleagues who set up these clusters say that Spark is the easiest.

What do I think about the scalability of the solution?

It is scalable, but we don't have the need to scale it. 

It is mainly used for reporting big data in our organization. All teams, especially the VR team, are using Spark for job execution and remote execution. I can say that 70% of users use Spark for reporting, calculations, and real-time operations. We are a very big company, and we have around a thousand people in IT.

We will continue its usage and develop more. We have kind of just started using it. We finished this project just three months ago. We are now trying to find out bottlenecks in our systems, and then we are ready to go.

How are customer service and support?

We have not used Apache support. We have only used Cloudera support for this project, and they helped us a lot during the development cycle of this project. 

How was the initial setup?

I don't have any idea about it. We are a big company, and we have another group for setting up Spark.

What other advice do I have?

I would advise planning well before implementing this solution. In enterprise corporations like ours, there are a lot of policies. You should first find out your needs, and after that, you or your team should set it up based on your needs. If your needs change during development because of the business requirements, it will be very difficult. 

If you are clear about your needs, it is easier to set it up. If you know how Spark is used in your project, you have to define firewall rules and cluster needs. When you set up Spark, it should be ready for people's usage, especially for remote job execution. 

I would rate Apache Spark a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Director at Nihil Solutions
Real User
Stable and easy to set up with a very good memory processing engine
Pros and Cons
  • "The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."
  • "The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate."

What is our primary use case?

When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.

What is most valuable?

The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly.

What needs improvement?

There are lots of items coming down the pipeline in the future. I don't know what features are missing. From my point of view, everything looks good.

The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate.

There should be more information shared to the user. The solution already has all the information tracked in the cluster. It just needs to be accessible or searchable.

For how long have I used the solution?

I started using the solution about four years ago. However, it's been on and off since then. I would estimate in total I have about a year and a half of experience using the solution.

What do I think about the stability of the solution?

The stability of the solution is very, very good. It doesn't crash or have glitches. It's quite reliable for us.

What do I think about the scalability of the solution?

The scalability of the solution is very good. If a company has to expand it, they can do so.

Right now, we have about six or seven users that are directly on the product. We're encouraging them to use more data. We do plan to increase usage in the future.

How are customer service and technical support?

I'm a developer, so I don't interact directly with technical support. I can't speak to the quality of their service as I've never directly dealt with them.

Which solution did I use previously and why did I switch?

We did previously use a lot of different mechanisms, however, we needed something that was good at processing data for analytical purposes, and this solution fit the bill. It's a very powerful tool. I haven't seen other tools that could do precisely what this one does.

How was the initial setup?

The initial setup isn't too complex. It's quite straightforward.

We use CACD DevOps from deployment. We only use Spark for processing and for the Data Bricks cluster to spin off and do the job. It's continuously running int he background.

There isn't really any maintenance required per se. We just click the button and it comes up automatically, with the whole cluster and the Spark and everything ready to go.

What's my experience with pricing, setup cost, and licensing?

I'm unsure as to how much the licensing is for the solution. It's not an aspect of the product I deal with directly.

What other advice do I have?

We're customers and also partners with Apache.

While we are on version 2.6, we are considering upgrading to version 3.0.

I'd rate the solution nine out of ten. It works very well for us and suits our purposes almost perfectly.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user
Buyer's Guide
Apache Spark
June 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
856,873 professionals have used our research since 2012.
Suresh_Srinivasan - PeerSpot reviewer
Co-Founder at FORMCEPT Technologies
Real User
Top 10
Enables us to process data from different data sources
Pros and Cons
  • "We use Spark to process data from different data sources."
  • "In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."

What is our primary use case?

Our primary use case is for interactively processing large volume of data.

What is most valuable?

We use Spark to process data from different data sources. 

What needs improvement?

In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond

For how long have I used the solution?

I have been using Apache Spark for eight to nine years. 

What do I think about the stability of the solution?

It is a stable solution. The solution is ten out of ten on stability. 

What do I think about the scalability of the solution?

The solution is highly scalable. All of the technical guys use Spark. Our product is used by many people within our customers' company.

How was the initial setup?

The initial setup is straightforward. 

What's my experience with pricing, setup cost, and licensing?

The solution is moderately priced. 

What other advice do I have?

I rate the overall solution a ten out of ten. 

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer1283880 - PeerSpot reviewer
CEO International Business at a tech services company with 1,001-5,000 employees
MSP
Top 5
A powerful open-source framework for fast, flexible, and versatile big data processing, with a strong learning curve and resource demands
Pros and Cons
  • "The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations."
  • "It requires overcoming a significant learning curve due to its robust and feature-rich nature."

What is our primary use case?

In AI deployment, a key step is aggregating data from various sources, such as customer websites, debt records, and asset information. Apache Spark plays a vital role in this process, efficiently handling continuous streams of data. Its capability enables seamless gathering and feeding of diverse data into the system, facilitating effective processing and analysis for generating alerts and insights, particularly in scenarios like banking. 

What is most valuable?

The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations.

What needs improvement?

It requires overcoming a significant learning curve due to its robust and feature-rich nature.

For how long have I used the solution?

We have been using it for two years now.

What do I think about the stability of the solution?

It provides excellent stability. We never faced any issues with it.

What do I think about the scalability of the solution?

It ensures outstanding scalability capabilities.

Which solution did I use previously and why did I switch?

Opting for Apache Spark, an open-source solution, provides a distinct advantage by offering control over the code. This means you can identify issues, make necessary fixes, and determine what aspects to accept as they are. In contrast, dealing with a vendor may limit control, requiring you to submit requests and advocate for changes based on your business volume with them. This dependency on volume can potentially compromise control. To safeguard both your customers and your business, the choice of an open-source solution like Apache Spark allows for more autonomy and control over the technology stack.

What about the implementation team?

The system's smooth operation relies on deploying a comprehensive container with Kubernetes clusters, configured with essential toolsets. Instrumentation data from the backend is fed back to a central framework equipped with specific tools for driving various processes. In a case involving a customer with Red Hat and Postini clusters, the OpenShift Container Platform, comprising Kubernetes clusters, is used. The tools manage onboarding, infrastructure provisioning, certificate management, authorization control, etc. The deployment spans multiple independent data centers, like telecom circles in India, requiring unique approaches for various tasks, including disaster recovery planning and central alerting, facilitated through SaaS. The deployment process typically takes approximately forty to forty-five days for six thousand servers.

What was our ROI?

It provides a dual advantage by saving both time and money while enhancing performance, particularly by leveraging my skill sets. 

What's my experience with pricing, setup cost, and licensing?

It is an open-source solution, it is free of charge.

What other advice do I have?

I would give it a rating of seven out of ten, which, by my standards, is quite high.

Which deployment model are you using for this solution?

Private Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Salvatore Campana - PeerSpot reviewer
CEO & Founder at Xautomata
Real User
Top 5
Reduces startup time and gives excellent ROI
Pros and Cons
  • "Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term."
  • "The initial setup was not easy."

What is our primary use case?

I use Spark to run automation processes driven by data.

How has it helped my organization?

Apache Spark helped us with horizontal scalability and cost optimizations.

What is most valuable?

The most valuable feature is the grid computing.

What needs improvement?

An area for improvement is that when we start the solution and declare the maximum number of nodes, the process is shared, which is a problem in some cases. It would be useful to be able to change this parameter in real-time rather than having to stop the solution and restart with a higher number of nodes.

For how long have I used the solution?

I've been using Spark for around four years.

How was the initial setup?

The initial setup was not easy, but we created a means of asking the user about their needs, making the setup much easier. We can now deploy the platform in thirty minutes using the public cloud or Kubernetes space.

What was our ROI?

Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term.

What's my experience with pricing, setup cost, and licensing?

Spark is an open-source solution, so there are no licensing costs. 

What other advice do I have?

I would rate Apache Spark eight out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer1185906 - PeerSpot reviewer
Manager - Data Science Competency at a tech services company with 201-500 employees
Consultant
Fast-performance, cost-effective, and runs in a cloud-agnostic environment
Pros and Cons
  • "One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
  • "When you are working with large, complex tasks, the garbage collection process is slow and affects performance."

What is our primary use case?

My main task is working on predictive analytics, and Apache Spark is one of the tools that I utilize in this role. Primarily, we work with the predictive analysis of very large amounts of data.

Apache Spark is also helpful for data pre-processing, including data cleaning.

This solution is cloud-agnostic. You can use it with an EC2 instance and you can even install it on-premises. Some environments have it installed in VMs.

What is most valuable?

One of the key features is that Apache Spark is a distributed computing framework. You can have multiple slaves and distribute the workload between them.

Another feature is memory-based computing. This is unlike Hadoop, which relies on storage. As it uses in-memory data processing, Spark is very fast.

What needs improvement?

When you are working with large, complex tasks, the garbage collection process is slow and affects performance. This is an area where they need to improve because your job may fail if it is stuck for a long time while memory garbage collection is happening. This is the main problem that we have.

For how long have I used the solution?

I have been working with Apache Spark for the past four years.

What do I think about the stability of the solution?

This product is pretty stable. Companies like Facebook, Uber, and Netflix are all using Apache Spark. It's stable enough to be used all over the world.

What do I think about the scalability of the solution?

In our team that works on this, we have approximately 10 people.

How are customer service and support?

There is no official support for this solution. Because it's open-source and there is no cost involved, there is nobody to contact for support. Our own internal team of experts, which work on different problems, both support and contribute to the platform.

Which solution did I use previously and why did I switch?

I work on several open-source frameworks including Python, Scikit-learn, TensorFlow, PyTorch, H20.ai, and R. We don't endorse proprietary tools so we aren't working with them.

How was the initial setup?

With respect to the initial setup, it's neither easy nor very difficult. Our team has experience so it is not difficult for them. However, for a person that is new to using it, the setup might be very difficult.

What about the implementation team?

We have a team of experts in my company, and they handle it very well.

What's my experience with pricing, setup cost, and licensing?

This is an open-source tool, so it can be used free of charge. There is no cost involved.

What other advice do I have?

We are not using the current version of this platform, Spark 3. However, we do know that it is used in the market and it has new features. We will eventually move to it.

My advice for anybody who wants to use Apache Spark is that they have two options. The first is Databricks, which are the creators of Apache Spark, and use their proprietary version. If you choose this option then you will have to pay for the product.

If instead, you use Apache Spark, then you can rely on your own expert in-house team for support, maintenance, and deployment. In this option, you don't have to pay anything to anybody outside of your company.

I would rate this solution an eight out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Onur Tokat - PeerSpot reviewer
Big Data Engineer Consultant at Collective[i]
Consultant
Scala-based solution with good data evaluation functions and distribution
Pros and Cons
  • "Spark can handle small to huge data and is suitable for any size of company."
  • "Spark could be improved by adding support for other open-source storage layers than Delta Lake."

What is our primary use case?

I mainly use Spark to prepare data for processing because it has APIs for data evaluation. 

What is most valuable?

The most valuable feature is that Spark uses Scala, which has good data evaluation functions. Spark also supports good distribution on the clusters and provides optimization on the APIs.

What needs improvement?

Spark could be improved by adding support for other open-source storage layers than Delta Lake. The UI could also be enhanced to give more data on resource management.

For how long have I used the solution?

I've been using Spark for six years.

What do I think about the stability of the solution?

Generally, Spark works correctly without any errors. It may give out some errors if your data changes, but in that case, it's a problem with the configuration, not with Spark.

What do I think about the scalability of the solution?

The cloud version of Spark is very easy to scale.

How was the initial setup?

The initial setup is not complex, but it depends on the product's component on the architecture. For example, if you use Hadoop, setup may not be easy. Deployment takes about a week, but the Spark cluster can be installed in the virtual architecture in a day.

What other advice do I have?

Spark can handle small to huge data and is suitable for any size of company. I would rate Spark as eight out of ten. 

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user
reviewer1535340 - PeerSpot reviewer
Senior Solutions Architect at a retailer with 10,001+ employees
Real User
A unified analytics engine with a valuable parallel processing feature
Pros and Cons
  • "I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library."
  • "The logging for the observability platform could be better."

What is our primary use case?

We use Apache Spark to prepare data for transformation and encryption, depending on the columns. We use AES-256 encryption. We're building a proof of concept at the moment. We prepare patches on Spark for Kubernetes on-premise and Google Cloud Platform.

What is most valuable?

I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library.

What needs improvement?

The logging for the observability platform could be better.

For how long have I used the solution?

I know about this technology for a long time, maybe for about three years.

Which solution did I use previously and why did I switch?

Because my area is data analytics and analytics solutions, I use BigQuery, SQL, and ETL. I also use Dataproc and DataFlow.

What about the implementation team?

We use an integrator sometimes, but recently we put together a team to support the infrastructural requirements. This is because the proof of concept is self-administered.

What other advice do I have?

I would recommend Apache Spark to new users, but it depends on the use case. Sometimes, it's not the best solution.

On a scale from one to ten, I would give Apache Spark a ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: June 2025
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.