We primarily use the solution for security analytics.
Snr Security Engineer at a tech vendor with 201-500 employees
Provides security analytics and has good scalability
Pros and Cons
- "The scalability has been the most valuable aspect of the solution."
- "The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive."
What is our primary use case?
What is most valuable?
The scalability has been the most valuable aspect of the solution.
What needs improvement?
The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive.
For how long have I used the solution?
I've been using the solution for three years.
Buyer's Guide
Apache Spark
August 2025

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: August 2025.
865,164 professionals have used our research since 2012.
What do I think about the stability of the solution?
The 2.3 version is quite stable. All of our customers use it, there are around 100,000+ users, and it runs 24/7.
What do I think about the scalability of the solution?
The scalability is very good.
How are customer service and support?
You actually buy Cloudera along with it. You don't really get any support, except you need support.
Which solution did I use previously and why did I switch?
In previous companies, we used MySQL platform and solutions like ArcSight and Splunk. We switched for scalability. MySQL wasn't going to scale, and we don't use Splunk at this company.
How was the initial setup?
The initial setup was complex. It is a complex tool. It's a lot to do with how you will use it. There is a lot to set up. They need to put a lot of scripts to it. There's nearly 60 to set up. When you set up the cloud, it takes about a day to set up. If you set it up on-premise, you know, on hardware, it only takes about a week.
What other advice do I have?
I would rate this solution eight out of 10.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Portfolio Manager, Enterprise Solutions Architect at Capgemini
Supports streaming and micro-batch
What is our primary use case?
Streaming telematics data.
How has it helped my organization?
It's a better MR, supports streaming and micro-batch, and supports Spark ML and Spark SQL.
What is most valuable?
It supports streaming and micro-batch.
What needs improvement?
Better data lineage support.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Buyer's Guide
Apache Spark
August 2025

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: August 2025.
865,164 professionals have used our research since 2012.
Director - Data Management, Governance and Quality at Hilton Worldwide
Powerful language but complicated coding
What is our primary use case?
Ingesting billions of rows of data all day.
How has it helped my organization?
Spark on AWS is not that cost-effective as memory is expensive and you cannot customize hardware in AWS. If you want more memory, you have to pay for more CPUs too in AWS.
What is most valuable?
Powerful language.
What needs improvement?
It is like going back to the '80s for the complicated coding that is required to write efficient programs.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Works at a computer software company with 51-200 employees
Features include machine learning, real time streaming, and data processing. It doesn't enable spark job scheduling with monitoring capability.
Pros and Cons
- "Features include machine learning, real time streaming, and data processing."
- "The fault tolerant feature is provided."
- "It provides a scalable machine learning library."
- "It should support more programming languages."
- "Needs to provide an internal schedule to schedule spark jobs with monitoring capability."
What is our primary use case?
Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.
How has it helped my organization?
It provides a scalable machine learning library so that we can train and predict user behavior for promotion purposes.
What is most valuable?
Machine learning, real time streaming, and data processing are fantastic, as well as the resilient or fault tolerant feature.
What needs improvement?
I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.
For how long have I used the solution?
Trial/evaluations only.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Manager | Data Science Enthusiast | Management Consultant at a consultancy with 5,001-10,000 employees
We can now harness richer data sets and benefit from use cases
Pros and Cons
- "With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."
- "Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing."
How has it helped my organization?
Organisations can now harness richer data sets and benefit from use cases, which add value to their business functions.
What is most valuable?
Distributed in memory processing. Some of the algorithms are resource heavy and executing this requires a lot of RAM and CPU. With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.
What needs improvement?
Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.
For how long have I used the solution?
Three to five years.
What do I think about the stability of the solution?
At times when users do not know how to use Spark and request a lot of resources, then the underlying JVMs can crash, which is a big sense of worry.
What do I think about the scalability of the solution?
No issues.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Big Data and Cloud Solution Consultant at a financial services firm with 10,001+ employees
Provides flexibility for application creation with less coding effort
Pros and Cons
- "DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort."
- "Dynamic DataFrame options are not yet available."
What is most valuable?
DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort.
How has it helped my organization?
We developed a tool for data ingestion from HDFS->Raw->L1 layer with data quality checks, putting data to elastic search, performing CDC.
What needs improvement?
Dynamic DataFrame options are not yet available.
For how long have I used the solution?
One and a half years.
What do I think about the stability of the solution?
No.
What do I think about the scalability of the solution?
No.
What other advice do I have?
Spark gives the flexibility for developing custom applications.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Sr. Software Engineer at a tech vendor with 1-10 employees
Helped us reduce 3TB Google Ngrams in hours instead of days
Pros and Cons
- "The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics."
- "More ML based algorithms should be added to it, to make it algorithmic-rich for developers."
What is most valuable?
The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics. The community is growing and hence executing ML in a distributed fashion is quite good.
How has it helped my organization?
Previously we were using Hadoop MapReduce to reduce the Google Ngrams (3TB), which took us approximately five days on our cluster. After using Spark, we were able to accomplish this task within hours.
What needs improvement?
This product is already improving as the community is developing it rapidly. More ML based algorithms should be added to it, to make it algorithmic-rich for developers.
For how long have I used the solution?
Two and a half years.
What do I think about the stability of the solution?
No, I did not encounter any problems with the stability. It is also quite backwards compatible.
What do I think about the scalability of the solution?
No I did not as of now, it is quite scalable. Using simple scripts you can add as many workers as you want.
What other advice do I have?
This is a very good product for the big data analytics and integrates well with other parts like Machine Learning and graph analytics.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Architect at a healthcare company with 51-200 employees
Having everything in the same framework has helped us out a lot
Pros and Cons
- "ETL and streaming capabilities."
- "Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet)."
What is most valuable?
ETL and streaming capabilities.
How has it helped my organization?
Made Big Data processing more convenient and a uniform framework adds to efficiency of usage since the same framework can be used for batch and stream processing.
What needs improvement?
Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).
For how long have I used the solution?
I have used Spark since its inception in March 2015, from Spark 1.1 onwards.
Currently, I use 2.2 extensively.
What do I think about the stability of the solution?
Yes, occasionally with different APIs.
What do I think about the scalability of the solution?
No.
How are customer service and technical support?
Since we were using the Open Source version of Apache Spark, without the Databricks support, we never used technical support form Databricks.
Which solution did I use previously and why did I switch?
Yes we used Hive, Pig, and Storm. Having everything in the same framework has helped us out a lot.
Which other solutions did I evaluate?
Yes, we considered other big data products in the Big Data Ecosystem.
What other advice do I have?
Go for it.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Updated: August 2025
Popular Comparisons
Spring Boot
Jakarta EE
Amazon EMR
AWS Lambda
Cloudera Distribution for Hadoop
AWS Fargate
Apache NiFi
AWS Batch
Amazon EC2 Auto Scaling
Vert.x
Amazon EC2
HPE Ezmeral Data Fabric
Spring MVC
Spark SQL
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which is the best RDMBS solution for big data?
- Apache Spark without Hadoop -- Is this recommended?
- Which solution has better performance: Spring Boot or Apache Spark?
- AWS EMR vs Hadoop
- Handling real and fast data - how do BigInsight and other solutions perform?
- When evaluating Hadoop, what aspect do you think is the most important to look for?
- Should we choose InfoSphere BigInsights or Cloudera?