We use it for data engineering and analytics to process and examine extensive datasets.
Data Engineer at BBD
A reliable and scalable open-source framework for big data processing that excels in speed, fault tolerance, and support for various data sources
Pros and Cons
- "It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained."
- "One limitation is that not all machine learning libraries and models support it."
What is our primary use case?
What is most valuable?
It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained.
What needs improvement?
One limitation is that not all machine learning libraries and models support it. While libraries like Scikit-learn may work with some Spark-compatible models, not all machine-learning tools are compatible with Spark. In such cases, you may need to extract data from Spark and train your models on smaller datasets instead of directly using Spark for training.
For how long have I used the solution?
I have been using it for four years.
Buyer's Guide
Apache Spark
August 2025

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: August 2025.
864,574 professionals have used our research since 2012.
What do I think about the stability of the solution?
I have not encountered any significant stability issues and it has proven to be a robust and reliable platform without major crashes. However, there have been instances where I needed to address query optimization and similar tasks to ensure optimal performance. I would rate it nine out of ten.
How are customer service and support?
To rate my overall experience, I would give it an eight out of ten, leaving room for potential improvements in terms of technical support.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
We used Pandas data frames and SQL-type queries for smaller datasets, but we haven't worked with anything on the scale of Spark SQL.
How was the initial setup?
I haven't handled the deployment process, but setting it up on the cloud seems relatively straightforward.
What about the implementation team?
Setting it up on-premises might take longer, potentially a couple of days. However, when deploying it on the cloud, the process can be significantly quicker, possibly taking only a few hours.
What's my experience with pricing, setup cost, and licensing?
On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing. Managing costs in a cloud environment can be challenging due to the cumulative expenses associated with running and maintaining Spark. Licensing costs may not be the primary concern, but operational costs in the cloud can add up. For on-premises deployments, maintenance costs include cluster management, job optimization, and upgrades. In the cloud, maintenance costs are relatively lower, especially with managed database clusters, but they still exist and primarily revolve around cluster upkeep.
Which other solutions did I evaluate?
We evaluated Microsoft Synapse, which offers similar analytics functionality but not quite at the same scale as Apache Spark and Spark as a whole. While some tasks can be accomplished with Synapse on AWS, there are certain features and capabilities, such as micro-batching and scalability, that Spark excels at and remains unmatched.
What other advice do I have?
Additional skill requirements are crucial to use the solution and its related features effectively. Training costs and efforts may be necessary to ensure individuals are proficient in using these technologies. Overall, I would rate it nine out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Information Technology Business Analyst at a aerospace/defense firm with 10,001+ employees
A highly scalable and affordable tool that can be used to gather information from different systems
Pros and Cons
- "The product is useful for analytics."
- "The product could improve the user interface and make it easier for new users."
What is most valuable?
We use it as an ETL tool to gather information from different systems. The product is useful for analytics.
What needs improvement?
The product could improve the user interface and make it easier for new users. It has a steep learning curve.
For how long have I used the solution?
I have been using the product for approximately three to four years. Currently, I am using the latest version.
What do I think about the stability of the solution?
The tool is stable. I rate the stability a ten out of ten.
What do I think about the scalability of the solution?
The tool is very scalable. I rate the scalability a ten out of ten. Approximately 30 users are using Apache Spark in our organization.
How are customer service and support?
We are using the free version of the product. So, we are not using any support.
How would you rate customer service and support?
Positive
How was the initial setup?
The basic installation is easy. However, we are working in the security business and need a very secure installation. It has been quite difficult. I rate the basic installation a ten out of ten. I rate the ease of setup a two or three out of ten for a more secure installation with all the security features. The solution is deployed on-premises in our organization. The deployment process requires a couple of weeks.
What's my experience with pricing, setup cost, and licensing?
We are using the free version of the solution.
What other advice do I have?
I would recommend the product. I think it's a good solution for analytics. Overall, I rate the product an eight out of ten.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Buyer's Guide
Apache Spark
August 2025

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: August 2025.
864,574 professionals have used our research since 2012.
Analyst at Deloitte
Processes a larger volume of data efficiently and integrates with different platforms
Pros and Cons
- "The product’s most valuable features are lazy evaluation and workload distribution."
- "They could improve the issues related to programming language for the platform."
What is our primary use case?
We use the product in our environment for data processing and performing Data Definition Language (DDL) operations.
What is most valuable?
The product’s most valuable features are lazy evaluation and workload distribution.
What needs improvement?
They could improve the issues related to programming language for the platform.
For how long have I used the solution?
We have been using Apache Spark for around two and a half years.
What do I think about the stability of the solution?
The platform’s stability depends on how effectively we write the code. We encountered a few issues related to programming languages.
What do I think about the scalability of the solution?
We have more than 100 Apache Spark users in our organization.
Which solution did I use previously and why did I switch?
Before choosing Apache Spark for processing big data, we evaluated another option, Hadoop. However, Spark emerged as a superior choice comparatively.
How was the initial setup?
The initial setup complexity depends on whether it's on the cloud or on-premise. For cloud deployments, especially using platforms like Databricks, the process is straightforward and can be configured with ease. However, if the deployment is on-premise, the setup tends to be more time-consuming, although not overly complex.
What's my experience with pricing, setup cost, and licensing?
They provide an open-source license for the on-premise version. However, we have to pay for the cloud version including data centers and virtual machines.
What other advice do I have?
Apache Spark is a good product for processing large volumes of data compared to other distributed systems. It provides efficient integration with Hadoop and other platforms.
I rate it a ten out of ten.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Vice President at Goldman Sachs at a computer software company with 10,001+ employees
Stable product with a valuable SQL tool
Pros and Cons
- "The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."
- "At the initial stage, the product provides no container logs to check the activity."
What is our primary use case?
We use the product for extensive data analysis. It helps us analyze a huge amount of data and transfer it to data scientists in our organization.
What is most valuable?
The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it. It is a useful feature for us.
What needs improvement?
At the initial stage, the product provides no container logs to check the activity. It remains inactive for a long time without giving us any information. The containers could start quickly, similar to that of Jupyter Notebook.
For how long have I used the solution?
We have been using Apache Spark for eight months to one year.
What do I think about the stability of the solution?
It is a stable product. I rate its stability an eight out of ten.
What do I think about the scalability of the solution?
We have 45 Apache Spark users. I rate its scalability a nine out of ten.
How was the initial setup?
The complexity of the initial setup depends on the kind of environment an organization is working with. It requires one executive for deployment. I rate the process an eight out of ten.
What's my experience with pricing, setup cost, and licensing?
The product is expensive, considering the setup. However, from a standalone perspective, it is inexpensive.
What other advice do I have?
I advise others to analyze data and understand your business requirements before purchasing the product. I rate it an eight out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Lead Data Scientist at International School of Engineering
A flexible solution that can be used for storage and processing
Pros and Cons
- "The most valuable feature of Apache Spark is its flexibility."
- "Apache Spark's GUI and scalability could be improved."
What is our primary use case?
We use Apache Spark for storage and processing.
What is most valuable?
The most valuable feature of Apache Spark is its flexibility.
What needs improvement?
Apache Spark's GUI and scalability could be improved.
For how long have I used the solution?
I have been using Apache Spark for four to five years.
What do I think about the scalability of the solution?
Around 15 data scientists are using Apache Spark in our organization.
How was the initial setup?
Apache Spark's initial setup is slightly complex compared to other other solutions. Data scientists could install our previous tools with minimal supervision, whereas Apache Spark requires some IT support. Apache Spark's installation is a time-consuming process because it requires ensuring that all the ports have been accessed properly following certain guidelines.
What about the implementation team?
While installing Apache Spark, I must look at the documentation and be very specific about the configuration settings. Only then I'll be able to install it.
What's my experience with pricing, setup cost, and licensing?
Apache Spark is an expensive solution.
What other advice do I have?
I would recommend Apache Spark to other users.
Overall, I rate Apache Spark an eight out of ten.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Data Engineer at Berief Food GmbH
A useful and easy-to-deploy product that has an excellent data processing framework
Pros and Cons
- "The data processing framework is good."
- "The solution must improve its performance."
What is our primary use case?
Our customers configure their software applications, and I use Apache to check them. We use it for data processing.
What is most valuable?
The data processing framework is good. The product is very useful.
What needs improvement?
The solution must improve its performance.
For how long have I used the solution?
I have been using the solution for four to five years.
What do I think about the stability of the solution?
The tool is stable. I rate the stability more than nine out of ten.
What do I think about the scalability of the solution?
We have a small business. Around four people in my organization use the solution.
How was the initial setup?
The deployment was easy.
What about the implementation team?
The solution was deployed with the help of third-party consultants.
What other advice do I have?
Overall, I rate the product more than eight out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Partner / Head of Data & Analytics at Intelligence Software Consulting
Great for machine learning applications; good documentation available
Pros and Cons
- "Provides a lot of good documentation compared to other solutions."
- "The migration of data between different versions could be improved."
What is our primary use case?
We use Spark for machine learning applications, clustering, and segmentation of customers.
What is most valuable?
Apache provides a lot of good documentation compared to other solutions.
What needs improvement?
The migration of data between different versions could be improved.
For how long have I used the solution?
I've been using this solution for four years.
What do I think about the stability of the solution?
The solution is stable.
What do I think about the scalability of the solution?
The solution is scalable.
How are customer service and support?
If you pay for customer support then you get a quick and efficient response, otherwise the community support offers good help.
How was the initial setup?
The initial setup has been simplified over the past few years and is now relatively straightforward.
What's my experience with pricing, setup cost, and licensing?
Licensing costs depend on where you source the solution.
What other advice do I have?
This is a good solution for big data use cases and I rate it eight out of 10.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Chief Data-strategist and Director at Theworkshop.es
Scalable, open-source, and great for transforming data
Pros and Cons
- "The solution has been very stable."
- "It's not easy to install."
What is our primary use case?
You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.
What is most valuable?
Overall, it's a very nice tool.
It is great for transforming data and doing micro-streamings or micro-batching.
The product offers an open-source version.
The solution has been very stable.
The scalability is good.
Apache Spark is a huge tool. It has many use cases and is very flexible. You can use it with so many other platforms.
Spark, as a tool, is easy to work with as you can work with Python, Scala, and Java.
What needs improvement?
If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing.
It's not easy to install. You are typically dealing with a big data system.
It's not a simple, straightforward architecture.
For how long have I used the solution?
I've been using the solution for three years.
What do I think about the stability of the solution?
The stability is very good. There are no bugs or glitches and it doesn't crash or freeze. It's a reliable solution.
What do I think about the scalability of the solution?
We have found the scalability to be good. If your company needs to expand it, it can do so.
We have five people working on the solution currently.
How are customer service and technical support?
There isn't really technical support for open source. You need to do your own studying. There are lots of places to find information. You can find details online, or in books, et cetera. There are even courses you can take that can help you understand Spark.
Which solution did I use previously and why did I switch?
I also use Databricks, which I use in the cloud.
How was the initial setup?
When handling big data systems, the installation is a bit difficult. When you need to deploy the systems, it's better to use services like Databricks.
I am not a professional admin. I am a developer for and design architecture.
You can use it in your standalone system, however, it's not the best way. It would be okay for little branch codes, not for production.
What's my experience with pricing, setup cost, and licensing?
We use the open-source version. It is free to use. However, you do need to have servers. We have three or four. they can be on-premises or in the cloud.
What other advice do I have?
I have the solution installed on my computer and on our servers. You can use it on-premises or as a SaaS.
I'd rate the solution at a nine out of ten. I've been very pleased with its capabilities.
I would recommend the solution for the people who need to deploy projects with streaming. If you have many different sources or different types of data, and you need to put everything in the same place - like a data lake - Spark, at this moment, has the right tools. It's an important solution for data science, for data detectors. You can put all of the information in one place with Spark.
Which deployment model are you using for this solution?
On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Updated: August 2025
Popular Comparisons
Spring Boot
Jakarta EE
Amazon EMR
AWS Lambda
Cloudera Distribution for Hadoop
AWS Fargate
Apache NiFi
AWS Batch
Amazon EC2 Auto Scaling
Vert.x
Amazon EC2
HPE Ezmeral Data Fabric
Spring MVC
Spark SQL
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which is the best RDMBS solution for big data?
- Apache Spark without Hadoop -- Is this recommended?
- Which solution has better performance: Spring Boot or Apache Spark?
- AWS EMR vs Hadoop
- Handling real and fast data - how do BigInsight and other solutions perform?
- When evaluating Hadoop, what aspect do you think is the most important to look for?
- Should we choose InfoSphere BigInsights or Cloudera?