Apache Spark Reviews

Name: Apache Spark
Brand: Apache
Rating: 4.2 (69 reviews)

Vendor: Apache

4.2 out of 5

69 reviews
90% willing to recommend

Leave a review

What is Apache Spark?

Apache Spark is a leading open-source processing tool known for scalability and speed in managing large datasets. It supports both real-time and batch processing and is widely used for building data pipelines, machine learning applications, and analytics.

Get the Apache Spark Buyer's Guide and find out what your peers are saying about Apache Spark, Spot by Flexera, Spring Boot and more!

Apache Spark is the #1 ranked solution in top Hadoop solutions, #2 ranked solution in top Java Frameworks, and #6 ranked solution in top Compute Service solutions. PeerSpot users give Apache Spark an average rating of 8.4 out of 10. Apache Spark is most commonly compared to Spot by Flexera: Apache Spark vs Spot by Flexera. Apache Spark is popular among the large enterprise segment, accounting for 50% of users researching this solution on PeerSpot. The top industry researching this solution are professionals from a financial services firm, accounting for 20% of all views.

Helped 904,748 peers since 2012

Featured Apache Spark reviews

Devindra Weerasooriya

Data Architect at Devtech

The in-memory computation feature is certainly helpful for my processing tasks. It is helpful because while using structures that could be held in memory rather than stored during the period of computation, I go for the in-memory option, though there are limitations related to holding it in memory that need to be addressed, but I have a preference for in-memory computation. The solution is beneficial in that it provides a base-level long-held understanding of the framework that is not variant day by day, which is very helpful in my prototyping activity as an architect trying to assess Apache Spark, Great Expectations, and Vault-based solutions versus those proposed by clients like TIBCO or Informatica.

Read full review

Michael Lierheimer

Consultant, Chief Engineer, Teamleiter at infoteam Software AG

I find that there really lacks the technical depth to do any recommendations for future updates of Apache Spark. I used it for two years for our prototype work and testing things, but because I had no final project with a release and running at the customer side or other side, I cannot say what I would expect if I wanted to use it in a real project. Regarding the current licensing cost, I would say it is in the medium range. However, because I do not have a licensed project for our customer, I do not know if it would be too high for our customers if they have to buy the license for themselves. For me, compared to other things, the licensing was acceptable.

Read full review

Omar Khaled

Data Engineer at a tech company with 10,001+ employees

I can improve the organization's functions by taking less time to make decisions. To make the right decision, you need the right data, and a solution can provide this by hiring talent and employees who can consolidate data from different sources and organize it. Not all solutions can make this data fast enough to be used, except for solutions such as Apache Spark Structured Streaming. To make the right decision, you should have both accurate and fast data. Apache Spark itself is similar to the Python programming language. Python is a language with many libraries for mathematics and machine learning. Apache Spark is the solution, and within it, you have PySpark, which is the API for Apache Spark to write and run Python code. Within it, there are many APIs, including SQL APIs, allowing you to write SQL code within a Python function in Apache Spark. You can also use Apache Spark Structured Streaming and machine learning APIs.

Read full review

Apache Spark mindshare

Product category:

As of July 2026, the mindshare of Apache Spark in the Hadoop category stands at 14.1%, down from 18.4% compared to the previous year, according to calculations based on PeerSpot user engagement data.

Hadoop Mindshare Distribution
Product	Mindshare (%)
Apache Spark	14.1%
Cloudera Distribution for Hadoop	14.4%
HPE Data Fabric	10.2%
Other	61.3%

Hadoop

PeerResearch reports based on Apache Spark reviews

Type	Title	Date
Category	Hadoop	Jul 17, 2026	Download
Product	Reviews, tips, and advice from real users	Jul 17, 2026	Download
Comparison	Apache Spark vs Cloudera Distribution for Hadoop	Jul 17, 2026	Download
Comparison	Apache Spark vs Amazon EMR	Jul 17, 2026	Download
Comparison	Apache Spark vs HPE Data Fabric	Jul 17, 2026	Download

Valuable Features

Apache Spark excels in speed, scalability, and flexibility. Users find its in-memory data processing efficient, allowing large-scale data handling. Features like Spark SQL, machine learning libraries, and real-time streaming are highly valuable. Its distributed computing framework facilitates processing across multiple nodes. Integration with various platforms and support for multiple languages enhances its usability. Organizations appreciate its fault tolerance, ease of use, and capability to process both batch and streaming data effectively.

"As it uses in-memory data processing, Spark is very fast."
"It is useful for handling large amounts of data, and it is very useful for scientific purposes."
"We are using Apache Spark, for large volume interactive data analysis."

Room for Improvement

Apache Spark requires better real-time querying, improved user interface, and enhanced documentation. Its complexity and steep learning curve present challenges. Integration with more languages and machine learning tools is needed. Users face issues with garbage collection affecting performance and memory usage. Monitoring and debugging capabilities should be more user-friendly. Stability and scalability concerns exist, and better connectors for databases are necessary. Stream processing improvements and enhanced API stability are also suggested.

"Apache Spark could improve the connectors that it supports."
"It is useful for scientific purposes, but for commercial use of big data, it gives some trouble."
"Apache Spark is very difficult to use. It would require a data engineer."

ROI

Apache Spark delivers significant cost savings and performance enhancements. Users experience reduced operational expenses, with one reporting a 50% decrease. The tool's capability to leverage expertise and efficiencies results in high returns within a medium timeframe. Open-source nature implies ROI varies, yet substantial reductions in both time and money are reported. Operational advantages arise from lower costs due to expertise availability, though additional memory and infrastructure may increase performance costs. Overall, Spark considerably impacts cost-efficiency.

Pricing

Apache Spark is an open-source tool, primarily free of charge, but operational costs can vary based on deployment. While utilizing open-source versions incurs no licensing fees, cloud and infrastructure expenses can be significant. Certain services, such as Cloudera or Databricks, may add costs for enhanced support or bundled packages. Setup time generally spans four to five weeks. Licensing and costs depend on project specifics and can be influenced by existing infrastructure and platform choices.

"I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten."
"Considering the product version used in my company, I feel that the tool is not costly since the product is available for free."
"The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks."

Popular Use Cases

Organizations primarily utilize Apache Spark for processing large datasets, executing data analytics, and conducting predictive analytics. Common tasks include real-time data streaming, data integration, data transformation, machine learning, and building ETL pipelines. Apache Spark's in-memory computing capabilities make it efficient for big data processing, enabling tasks like clustering, segmentation, and batch processing. It supports multiple environments, from cloud to on-premise, and integrates seamlessly with tools like Data Bricks and machine learning programs.

Service and Support

Apache Spark, as open-source, lacks official technical support, relying on community forums and documentation for guidance. Some users find this sufficient, pointing out the vibrant community and resources available online or via vendors like Cloudera. Others mention limitations in response times or quality due to the nature of open-source support. While free versions depend on the community, paid services from Databricks or Cloudera provide more structured assistance, which users find beneficial.

Deployment

Apache Spark's initial setup varies in complexity based on environment and expertise. Many find deploying in cloud environments like Databricks straightforward, often taking just minutes. In contrast, on-premise setups can be more challenging, requiring extensive configuration and time. Experience with distributed systems influences ease, with knowledgeable teams finding the process simpler. Some users note integration difficulties with additional services, while security configurations significantly increase complexity. Documentation and specialized consulting can facilitate smoother installations.

Scalability

Apache Spark is highly scalable, supporting both large and small teams across varied industries. Users appreciate its capacity for expansion, often employing additional monitoring and technical expertise for optimization. They highlight its versatility across different user types and its reliable performance with large data sets. While some find node addition challenging, others praise the straightforward scaling in cloud environments. Proper infrastructure management is key, ensuring strong performance and efficient resource use.

Stability

Apache Spark is widely stable according to user feedback. Companies find it reliable for large-scale operations, citing few bugs or crashes. It effectively handles tasks, though some experience challenges with initial setups or streaming data. Memory issues and optimization needs occur but are manageable with proper configuration. Users rate its stability high, appreciating its robust performance in handling big data workloads, especially with newer versions which have addressed previous difficulties.

These insights are based on the in-depth reviews provided by peers to help you make a better buying decision.

Download our Apache Spark Buyer's Guide for additional reliable information.

Review data by company size

By reviewers
Company Size	Count
Small Business	25
Midsize Enterprise	14
Large Enterprise	25

By reviewers

By visitors reading reviews
Company Size	Count
Small Business	173
Midsize Enterprise	49
Large Enterprise	218

By visitors reading reviews

Top industries

By visitors reading reviews

Financial Services Firm

20%

Manufacturing Company

Construction Company

Comms Service Provider

Outsourcing Company

Marketing Services Firm

Computer Software Company

Healthcare Company

Government

Educational Organization

University

Insurance Company

Retailer

Transportation Company

Performing Arts

Media Company

Real Estate/Law Firm

Consumer Goods Company

Legal Firm

Non Profit

Recreational Facilities/Services Company

Pharma/Biotech Company

Renewables & Environment Company

Hospitality Company

Religious Institution

Energy/Utilities Company

Compare Apache Spark with alternative products

Learn more about Apache Spark

Apache Spark's strengths lie in its ability to process large data volumes efficiently through real-time and batch capabilities. With in-memory computation, it ensures fast data processing and significant performance gains. Its wide range of APIs, including those for machine learning, SQL, and analytics, make it versatile in handling complex data operations. While popular for ease of use and fault tolerance, Spark's management, debugging, and user-friendliness could benefit from improvements. Better GUIs, integration with BI tools, and enhanced monitoring are desired, alongside shuffling optimization and compatibility with more programming languages.

What are Apache Spark's key features?

Scalability: Efficiently manages large datasets across nodes.
Performance: In-memory computation for faster data processing.
Real-time Processing: Supports real-time analytics and data streaming.
APIs: Offers extensive APIs for machine learning, SQL, and analytics.

What benefits or ROI should users look for in reviews?

Ease of Use: Simplifies complex data tasks through intuitive operations.
Fault Tolerance: Ensures data reliability and continuous operations.
Integration Flexibility: Easily integrates with big data platforms and tools.

Organizations use Apache Spark predominantly for in-memory data processing, enabling seamless integration with big data frameworks. It's applied in security analytics, predictive modeling, and helps facilitate secure data transmissions in AI deployments. Industries leverage Spark's speed for sentiment analysis, data integration, and efficient ETL transformations.

Apache Spark customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Product Categories

Hadoop

Compute Service

Java Frameworks

Popular Comparisons

Spot by Flexera vs Apache Spark

Spring Boot vs Apache Spark

AWS Lambda vs Apache Spark

Cloudera Distribution for Hadoop vs Apache Spark

IBM Netezza Performance Server vs Apache Spark

Amazon EC2 vs Apache Spark

Amazon EMR vs Apache Spark

AWS Fargate vs Apache Spark

IBM Spectrum Computing vs Apache Spark

Apache NiFi vs Apache Spark

HPE Data Fabric vs Apache Spark

Jakarta EE vs Apache Spark

AWS Batch vs Apache Spark

Amazon EC2 Auto Scaling vs Apache Spark

Helidon vs Apache Spark

See all alternatives

Apache Spark Reviews Summary
Author info	Rating	Review Summary
Data Architect at Devtech	4.5	I’ve used Apache Spark for four years, mainly for data integration and access. Its in-memory processing and open-source flexibility suit my needs, despite some stability issues. I prefer it over commercial tools like Informatica due to cost and adaptability.
Consultant, Chief Engineer, Teamleiter at infoteam Software AG	4.0	I used Apache Spark for two years in an on-prem prototype; setup was straightforward and support was good. I liked its fast database access, transformation, and reliable data exchange/integration. Licensing seemed midrange, but the customer ultimately chose another technology.
Data Engineer at a tech company with 10,001+ employees	5.0	I use Apache Spark for real-time data processing and transformation across multiple sources like CRM and Siebel. It's reliable, fast, and improves our decision-making, though I see future needs for better integration with emerging cloud solutions.
Data Scientist at a financial services firm with 10,001+ employees	4.5	I primarily use Apache Spark for data processing tasks involving large datasets, appreciating its ease of use and portability. While it's efficient for both small and large datasets, the lack of support for geospatial data is a limitation.
Head Of Data at Ekar	4.0	Apache Spark significantly reduced operational costs by 50% and although it supports parallel processing, it needs improvements in scalability and user-friendliness. Working with datasets isn't as straightforward as with Pandas, though it's flexible and functional.
Manager Data Analytics at a outsourcing company with 5,001-10,000 employees	3.5	We use Apache Spark to handle real-time data streaming and machine learning, significantly improving efficiency and reducing costs. It offers flexibility in scaling and integrates well with other tools, though its learning curve could be challenging for non-technical users.
Senior Software Architect at USEReady	4.0	I use Apache Spark for big data engineering, valuing its batch and streaming capabilities. While stable and scalable, its ecosystem is complex for beginners, and clustering setup can be tricky. I rate it 8/10.
Sr Manager at a transportation company with 10,001+ employees	4.5	I use Apache Spark for real-time data processing and ETL tasks. It offers unparalleled features but faces limitations due to its in-memory implementation. Despite improvements in version 3.0, reducing costs and addressing memory issues would enhance it further.
Senior Developer at Infosys	3.5	My experience with Spark for large-scale distributed data transformations is positive due to its speed and cost reduction. While setup is complex and scheduling needs external tools, I recommend it for big data processing.
Head of Data Science center of excellence at Ameriabank CJSC	4.0	I use Apache Spark primarily for in-memory processing of big data, which is valuable for tasks like running ML algorithms. Although its Pandas UDF support is advantageous, the Java overhead and performance issues suggest alternatives may be preferable.

Devindra Weerasooriya

Data Architect at Devtech

Nov 20, 2025

Provides a consistent framework for building data integration and access solutions with reliable performance

What is our primary use case?

I am not just an end customer of Apache Spark; I use it for the solutions that I produce, primarily data integration solutions and data access solutions based on Apache Spark, but it may depend on the situation to align with other tools present in various customer locations, such as Informatica.

Essentially, my main reason for using Apache Spark is data integration, and the two major use cases are data access and data integration.

I do use Apache Spark for event analysis.

Apache Spark, specifically PySpark and the tools available there, have been quite helpful in my event analysis work.

What is most valuable?

The in-memory computation feature is certainly helpful for my processing tasks.

It is helpful because while using structures that could be held in memory rather than stored during the period of computation, I go for the in-memory option, though there are limitations related to holding it in memory that need to be addressed, but I have a preference for in-memory computation.

The solution is beneficial in that it provides a base-level long-held understanding of the framework that is not variant day by day, which is very helpful in my prototyping activity as an architect trying to assess Apache Spark, Great Expectations, and Vault-based solutions versus those proposed by clients like TIBCO or Informatica.

What needs improvement?

Areas for improvement are obviously ease of use considerations, though there are limitations in doing that, so while various tools like Informatica, TIBCO, or Talend offer specific aspects, licensing can be costly; I prefer to work this way, which does not imply being anti-tooling, but since your focus was on my technology, these will continue to be my technologies.

For how long have I used the solution?

I have been using Apache Spark for approximately four years.

What do I think about the stability of the solution?

Without a doubt, we have had some crashes because each situation is different, and while the prototype in my environment is stable, we do not know everything at other customer sites, and we have not identified a way of assessing the sanity of the whole environment, so there have been some crashes.

What do I think about the scalability of the solution?

As long as one knows how to scale the environment, scalability is not a problem at all and is not difficult to manage.

How are customer service and support?

I have had experience with technical support from Apache, mainly through newsgroups, which have been reasonably forthcoming when required.

I think it would be farcical to compare technical support from Apache Spark to what I would receive under an Informatica license, but I have received support via newsgroups or guidance on specific discussions, which is what I would expect in an open-source situation.

How would you rate customer service and support?

Neutral

How was the initial setup?

I think things are becoming easier for installation and deployment, and I have cleaned up the process over a number of years to minimize pain, although the process of installing a commercial tool might be easier.

What other advice do I have?

API management is not my interest at the moment, so I do not remember reading information about API management products and Enterprise Service Bus, ESB in the past, and I am not using any solutions like that.

I am working much more in data science and data engineering.

I can discuss my experience with data engineering tools, and I am using some now.

I work primarily with open-source tools, and that is the way I work instead of big data solutions like Informatica.

The open-source big data stacks I use include Apache Spark, Hive, and tools such as Vault for protection and Great Expectations for data quality, which are my primary tools along with the PostgreSQL database for database support.

I have been using Apache Spark for quite a long time.

Real-time data analytics is an area of interest for me, but I have not had to do that in most of what I have done, although it is increasingly becoming an area of interest that I am looking seriously at Apache possibilities for instead of going down the Kafka path, using the Apache streaming API to see if it fits my use cases, but I do not know it very well.

Very often in many of my experiments, the data set has had to be partitioned, and there have been issues in handling very large data sets, with most of my work done using Python machine learning libraries, requiring chunking, and speed of prediction has been an issue of concern in some experiments where we have had to shut down processes due to CPU requirements, then restart with different Apache configurations, and resourcing support is a major determinant if I were to name a constraint in terms of running machine learning experiments.

Apache Spark is basically free compared to other similar products like Informatica and Talend.

Sometimes, we have to use cloud resources due to insufficient on-premise resources for certain types of computing, so it is hybrid.

Our builds have all been based on the Azure Apache marketplace.

While I have worked with Hive and not really on Talend, I confirm I indeed work mainly with Apache Spark and Hive.

In the past, Hive has been a kind of a default part of the process, but that is not the case anymore in recent times.

I have given this review a rating of 9 out of 10.

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Microsoft Azure

Michael Lierheimer

Consultant, Chief Engineer, Teamleiter at infoteam Software AG

Feb 27, 2026

Prototype work has improved fast data access, transformation, and reliable cross-site exchange

What is our primary use case?

I attempted to use Apache Spark in one of our customer projects, but after the initial test, our customer moved to another technology and another database system. I do not have any final remarks on what would be better or what is not as good because we only tried it for the customer and made a running technical prototype. After we presented it to our customer, he moved to another system, which was preferred by one of his consultants. I do not know exactly what was the reason to move away from Apache Spark or the underlying database system, but it was simply a decision driven by the customer.

What is most valuable?

The best features in Apache Spark that I appreciate are the fast database access, the data transformation, and the data exchange.

I see that very good integration with other platforms, including interfaces that can connect to other vendors and technologies, and integration of the MCP protocol of one of AA systems, would be an interesting direction for me personally and as a company to integrate the technology in our customer projects. The most important part is that everything can be connected, and the data exchange across overseas connections is fast and reliable.

What needs improvement?

Regarding the current licensing cost, I would say it is in the medium range. However, because I do not have a licensed project for our customer, I do not know if it would be too high for our customers if they have to buy the license for themselves. For me, compared to other things, the licensing was acceptable.

For how long have I used the solution?

I have been working with Apache Spark for two years.

How are customer service and support?

I would rate the technical support of Apache Spark an eight because when we had questions, we found solutions, and it was straightforward.

How would you rate customer service and support?

Positive

How was the initial setup?

For me, the implementation of Apache Spark was straightforward. The deployment required mixed time; I cannot give any specific duration. It was step by step, over approximately a week, because I have other things to do and I have to do it in parallel with other topics I have to maintain. It is hard to give a duration.

What about the implementation team?

I did it all by myself.

What other advice do I have?

As for real-time data analytics with Apache Spark, I do not know if I would use inbuilt analytics or other options. Most of our customers have external AI systems polling the data from different data sources and doing their analytics in their preferred tool. I do not know if I would use integrated systems in Apache Spark itself, because we are working with customers who have a huge toolchain and have dedicated tools for nearly every task, and they are choosing their task for AI, predictive maintenance, and all the different features. I gave this review a rating of eight.

Which deployment model are you using for this solution?

On-premises

Omar Khaled

Data Engineer at a tech company with 10,001+ employees

Aug 12, 2025

Empowering data consolidation and fast decision-making with efficient big data processing

What is our primary use case?

I don't use a big data solution such as a Data Lake. I have used Apache Spark code on a lot of big data. We don't have a Data Lake and these new technologies, but we have used Apache Spark code to get data from big data, CRM, and Siebel. Siebel is the application that exists in Teleco branches in Egypt, and the customers communicate with it. We have a lot of data from the finance team, CRM, Siebel, and other sources. We consolidate all of this data and perform transformations.

With Apache Spark, we can perform various transformations. For instance, when a customer calls their mother and consumes many minutes, we should consolidate all of them and calculate the net minutes. For the monthly invoice, we should determine how much the customer should pay for ADSL, phones, and family phones. We take all of this information, transform it, and use it to generate the invoice.

We enhance the data processing by using Apache Spark SQL. I haven't used Apache Spark machine learning, but I've used Apache Spark SQL because we have data in HDFS tables. We take the Apache Spark code, get the data, and then get the aggregate. For example, we can request aggregated data about a customer's consumption since last week. You can use Apache Spark code or Python code itself, and if you know SQL, you can type SQL code within the same script and output it to any table or Excel file. It's durable and easy.

When a customer has an issue with their phone number and can't call or access the internet, they visit a branch and speak with an agent. We need to take action based on the data, so we need real-time data processing to get aggregated data for this customer from the last week or month. All data solutions serve the customer and business needs, which is what I appreciate about data solutions.

What is most valuable?

Apache Spark itself is similar to the Python programming language. Python is a language with many libraries for mathematics and machine learning. Apache Spark is the solution, and within it, you have PySpark, which is the API for Apache Spark to write and run Python code. Within it, there are many APIs, including SQL APIs, allowing you to write SQL code within a Python function in Apache Spark. You can also use Apache Spark Structured Streaming and machine learning APIs.

What needs improvement?

Regarding Apache Spark, I have only used Apache Spark Structured Streaming, not the machine learning components. I am uncertain about specific improvements needed today. However, after five years, there will be many new cloud providers, connectors, and solutions. Every year, there should be some enhancement to remain competitive in the market. As new solutions emerge frequently, the basic improvement would be to have integration with these solutions.

For how long have I used the solution?

I have used Apache Spark for three years. However, I have only used Apache Spark Structured Streaming, this specific library in Apache Spark, for one year.

What do I think about the stability of the solution?

I don't have any issues with Apache Spark that are better handled in Hadoop. I have experienced the opposite. I have used Hadoop, MapReduce, and Apache Spark, and I see many problems in MapReduce that Apache Spark resolves. Every new solution comes to solve problems from previous solutions. Apache Spark resolves many problems in the MapReduce solution and Hadoop, such as the inability to run effective Python or machine learning algorithms. MapReduce needs to perform numerous disk input and output operations, while Apache Spark can use memory to store and process data.

I have used it daily for one year without issues. I also have colleagues with five years of experience who haven't encountered any problems.

Which solution did I use previously and why did I switch?

I have used the Hadoop ecosystem, Hadoop HDFS, Impala and Hive. Hive is similar to SQL but within the Hadoop ecosystem. It queries data that HDFS has stored in folders. I have used Python, Apache Spark, and Apache Spark Structured Streaming from Apache Spark. I have also used MapReduce, which is a processing engine similar to Apache Spark. To use MapReduce, you need to be familiar with the Java programming language. Since not everyone knows Java, Apache Spark is considered better because users can use Python and SQL within it.

Which other solutions did I evaluate?

I prefer to use Apache Spark as an open-source solution if you already have experienced data engineers who use Apache Spark. If you don't have this expertise, you need to hire additional employees. However, if you have a Databricks solution that includes Apache Spark, you have customer service and may need to hire fewer employees.

What other advice do I have?

I am unable to suggest specific improvements at this time. This review has a rating of 10 out of 10.

Which deployment model are you using for this solution?

On-premises

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other

Dunstan Matekenya

Data Scientist at a financial services firm with 10,001+ employees

Jul 10, 2024

Open-source solution for data processing with portability

What is our primary use case?

Most of my use cases involve data processing. For example, someone tried to run sentiment analysis on Databricks using Apache Spark. They had to handle data from many countries and languages, which presented some challenges. Besides that, I primarily use Apache Spark for data processing tasks. I work with mobile phone datasets, around one terabyte in size. This involves extracting and analyzing data before building any models.

What is most valuable?

Apache Spark is known for its ease of use. Compared to other available data processing frameworks, it is user-friendly. While many choices now exist, Spark remains easy to use, particularly with Python. You can utilize familiar programming styles similar to Pandas in Python, including object-oriented programming.

Another advantage is its portability. I can prototype and perform some initial tasks on my laptop using Spark without needing to be on Databricks or any cloud platform. I can transfer it to Databricks or other platforms, such as AWS. This flexibility allows me to improve processing even on my laptop. For instance, if I'm processing large amounts of data and find my laptop becoming slow, I can quickly switch to Spark. It handles small and large datasets efficiently, making it a versatile tool for various data processing needs.

What needs improvement?

Apache Spark lacks geospatial data.

For how long have I used the solution?

I have started using Apache Spark in 2015.

How are customer service and support?

Apache Spark is open-source software. The documentation is pretty decent. I have used the online resources and mailing list but have never received a positive response.

How was the initial setup?

The initial setup is very easy. It's proper to start. Setting up Apache Spark has become more accessible, even when creating a new environment. It is available in the Python package ecosystem, so you can install it using pip. However, you still need to set up the Java environment for it to work properly.

Apache Spark is different. I can use Apache Spark on my computer. Even though I don't have a super high-performance computer, it has nine cores and 40 GB of memory, sufficient to run Spark on my laptop without much setup.

We also have a cluster of much larger computers with hundreds of cores where we run Spark. Managing Spark on this cluster requires more setup. Initially, we used HDP, but it has since changed to something else. Setting up Spark on a cluster takes some time because Spark doesn't necessarily run independently in a cluster environment. You need a cluster management platform or framework like Hadoop or Kubernetes to manage and run Spark efficiently.

So, while setting up Spark on my laptop is straightforward, setting it up on a custom cluster involves more effort and configuration.

I rate the initial setup an eight or nine out of ten, where one is difficult, and ten is easy.

What other advice do I have?

Apache Spark is my go-to solution for processing large-scale datasets. I would recommend it 100%. One of the main reasons is its ease of use. You can start using it on your laptop without any extra infrastructure, and then you can take that same code and run it anywhere else, including on the cloud. You're not locked in by any vendor, which is a significant advantage.

Overall, I rate the solution a nine out of ten as a big data processing engine.

Madhan Potluri

Head Of Data at Ekar

Aug 5, 2024

Offers user-friendliness, clarity and flexibility

What needs improvement?

The only issue I faced with the tool was that I used to choose the compute device to support parallel processing, and it has to be more like scaling up horizontally. The tool should be more scalable, not in terms of increasing the CPU or something, but more in the area of units. If two units are not enough, the third or fourth unit should be able to come into the picture.

From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable. Sometimes, I get an error saying that it is an RDD-related error, and it becomes difficult to understand where it went wrong. When I deal with datasets using a library called Pandas in Python, I can actually apply functions on each column and get a transformation from the column. When I try to do the same thing with Apache Spark, it is okay and works, but it is not straightforward; I need to deal with it a little differently, and even after trying to do that differently, the problem I face there is, sometimes it will throw an error saying that it is looping back to the same, but I was not getting that kind of errors in Pandas.

In future updates, the tool should be made more user-friendly. I want to take fifty parallel processes rather than one, and I want to pick some particular columns to be split by partition, so if the tool is user-friendly and offers clarity and flexibility, then that will be good.

For how long have I used the solution?

I have been using Apache Spark for four years.

What do I think about the stability of the solution?

Stability-wise, I rate the solution a nine out of ten. The only issues with the tool revolve around user interaction and user flexibility.

What do I think about the scalability of the solution?

It is a scalable solution. Scalability-wise, I rate the solution an eight out of ten.

Around five people in my company use the tool.

How are customer service and support?

The solution's technical support is helpful, but I faced some problems which were more of a generic issue. If I face any problems which are non- generic issues, I get help from the tool's team. For the generic issues, I get answers mainly from the forums where the problem was already resolved. When it comes to some unknown problem or specific problem with my work, then the support takes time. I rate the technical support a seven out of ten.

How would you rate customer service and support?

Neutral

Which solution did I use previously and why did I switch?

I only work with Apache Spark.

How was the initial setup?

The product's initial setup phase was easy.

I managed the product's installation phase, both locally and on the cloud.

The solution is deployed on the on-premises version.

The solution can be deployed in two to three hours.

What was our ROI?

Apache Spark has helped save 50 percent of the operational costs. Time was reduced with the use of the tool, but the computing part increased. Overall, I can see that the tool's use has led to a 50 percent reduction in costs.

What's my experience with pricing, setup cost, and licensing?

I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten.

Which other solutions did I evaluate?

Previously, I was more of a Python full-stack developer, and I was happy dealing with PySpark libraries, which gave me an edge in continuing the work with Apache.

What other advice do I have?

Speaking about Apache Spark's use in our company's data processing workflows, I would say that when we deal with large datasets of data, if we don't use Spark, then when we try to use a data frame consisting of one year of data, it used to take me 45 minutes to an hour. Moreover, sometimes I used to get the memory out of space errors, but such issues were avoided the moment I started using Apache Spark, as I was able to get the whole processing done in less than five minutes, and there were no memory issues.

For big data processing, the tool's parallel processing and time are areas that have been helpful. When I try to apply a function, I can directly data write one code. Basically, I used Apache Spark to forecast multiple units at the same time, and if not with Apache Spark, I would be doing that one by one, which is more of a serial processing process that used to take me around five hours. At the moment, we use Apache Spark in parallel processing, where computing happens parallelly, and all these computations are cut down by at least 90 percent. It helps me significantly to reduce the time needed for operations.

The tool's real-time processing is an area that I have not tried to use much. When it comes to real-time processing of my data, I use Kafka.

I am handling data governance using Databricks Unity Catalog.

When I try to apply an ML model, I am unable to get that model done on a table partitioned by a particular column; it makes me get the job done in a reduced number of partitions. If I go with five partitions, I am able to get at least three to four times the benefits in a lesser amount of time.

Regular maintenance exists, but it is not like I have to sit week by week and upgrade a patch or something like that. The maintenance is done mostly in about six months to a year.

I take care of the tool's maintenance.

I recommend the tool to others.

I rate the tool an eight out of ten.

Which deployment model are you using for this solution?

On-premises

reviewer2534727

Manager Data Analytics at a outsourcing company with 5,001-10,000 employees

Aug 12, 2024

A flexible solution with real-time processing capabilities

What is our primary use case?

We use the solution to extract data from our sensors. We have lots of data streaming into our system, which used to get overwhelmed. We use Apache Spark to handle real-time streaming and do machine learning to predict supply and demand in the market and adjust operations.

What is most valuable?

I like Apache Spark's flexibility the most. Before, we had one server that would choke up. With the solution, we can easily add more nodes when needed. The machine learning models are also really helpful. We use them to predict energy theft and find infrastructure problems.

The tool's real-time processing has had a big impact. We used to get data from sensors after a month. We get it in less than 10 minutes, which helps us take quick action.

We use Apache Spark to map our data pipelines using MapReduce technology. We're also working on integrating tools like Hive with Apache Spark to distribute our data processing. We can also integrate other tools like Apache Kafka and Hadoop.

We faced some challenges when integrating the solution into our existing system, but good documentation helped solve them.

What needs improvement?

For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial.

For how long have I used the solution?

I have been working with the product for five years.

What do I think about the stability of the solution?

Apache Spark is stable.

What do I think about the scalability of the solution?

We're a big company with about 4 million consumers. We handle huge amounts of data—around 30,000 sensors send data every 15 minutes, which adds up to 5-10 terabytes per day.

Which solution did I use previously and why did I switch?

Before Apache Spark, we had a different solution - a traditional system with one server handling everything, more like a data warehouse. We switched to Apache Spark because we needed real-time visibility in our operations.

How was the initial setup?

The initial setup process was challenging. We tried to do it ourselves at first, but we weren't used to distributed computing systems, creating nodes, and distributing data. Later, we engaged consulting groups that specialized in it. This is why there's a specific learning curve—it would be challenging for a company to start alone.

The initial deployment took us about six to eight months. We started with three people involved in the deployment process and later increased to five. From a maintenance point of view, it's pretty smooth now. It's not difficult to maintain and doesn't require much maintenance.

What was our ROI?

The tool has helped us reduce costs that run into billions of dollars yearly. The ROI is very significant for us.

Which other solutions did I evaluate?

We did evaluate other options. We started by looking at open-source Hadoop deployment, thinking we'd bring data into HDFS and do machine learning separately. But that would have been a hassle, so Apache Spark was a better fit.

What other advice do I have?

I rate the overall solution a seven out of ten. I would recommend Apache Spark to other users, but it depends on their use cases. I advise new users to get an expert involved from the start.

Which deployment model are you using for this solution?

On-premises

KamleshPant

Senior Software Architect at USEReady

Apr 24, 2025

Handles both batch and streaming data efficiently for real-time processing

What is our primary use case?

I use Apache Spark for any data engineering part. I handle some computation processes where it is necessary to process big data.

What is most valuable?

Apache Spark's ability to handle both batch and streaming data is the most valuable feature for me. It is beneficial for consuming real-time data. It offers solid real-time processing capability, making it more efficient in managing data analytics. It is beneficial as it allows processing of both batch and streaming data seamlessly.

What needs improvement?

There is complexity when it comes to understanding the whole ecosystem, especially for beginners. I find it quite complex to understand how a Spark job is initiated, the roles of driver nodes, worker nodes, stages, and tasks. Additionally, clustering may be a bit complex to set up.

For how long have I used the solution?

I have been using Apache Spark for about two and a half years now.

What was my experience with deployment of the solution?

Clustering may be a bit complex to set up, but it depends on the experience that the person involved has.

What do I think about the stability of the solution?

I find Apache Spark to be fine and stable.

What do I think about the scalability of the solution?

The scalability of Apache Spark depends on the number of machines being used. By adjusting the worker and driver nodes, scaling can be leveraged.

How are customer service and support?

I haven't tried Apache Spark's official support. I mostly use ChatGPT for assistance.

How would you rate customer service and support?

Neutral

Which solution did I use previously and why did I switch?

Before Spark, I used solutions like Storm and Flume. In AWS, it is Kinesis. Both Kinesis and Spark have their own ways of managing data injection and compute.

How was the initial setup?

In the public cloud, it comes with built-in services, but for on-premises, I have to spin up my own cluster using my ecosystem.

What about the implementation team?

A single person can handle installation if they are capable enough.

What was our ROI?

Timing depends on the cluster being used, such as how many compute nodes and the kind of data there is. So, it's not straightforward to specify the percentage of time and money saved.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is open-source, so it doesn't incur any charges.

Which other solutions did I evaluate?

I used Storm, Flume, and AWS Kinesis.

What other advice do I have?

I rate my overall experience with Apache Spark as eight out of ten. I suggest leveraging AI capabilities to enhance performance or check for anomalies.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other

Sachin Shukre

Sr Manager at a transportation company with 10,001+ employees

Dec 6, 2023

Offers real-time and near-real-time data processing

What is our primary use case?

We use it for real-time and near-real-time data processing. We use it for ETL purposes as well as for implementing the full transformation pipelines.

What is most valuable?

There is no other platform that can challenge its features. Apart from the restrictions that come with its in-memory implementation.

What needs improvement?

Apart from the restrictions that come with its in-memory implementation. It has been improved significantly up to version 3.0, which is currently in use.

Once I get those insights, I can let you know if the restrictions have been overcome. For example, there is an issue with heap memory getting full in version 1.6. There are other improvements in 3.0, so I will check those.

In future releases, I would like to reduce the cost.

For how long have I used the solution?

We have been using this solution for 11 to 12 years. Now, it is deployed on the cloud or premises. Previously, it was on-premises when the version was below 1.6.

After version 1.6, it will be on the cloud. I have used it in all the major cloud providers: AWS, GCP, and Azure.

What do I think about the stability of the solution?

It is a stable solution, but when it comes to patch updates and different reports being updated, it can be a headache.

When an application is built on top of certain reports with specific versions, and then the version changes, it can lead to a lot of things needing to be adjusted. This is something that definitely needs improvement.

How are customer service and support?

I contacted customer service and support six or seven years ago when Spark was still on version 1.6.

We were struggling with memory limitations and the need for a lift and shift mechanism in a hybrid cloud mode. I contacted one or two people at that time.

How would you rate customer service and support?

Neutral

What's my experience with pricing, setup cost, and licensing?

It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project.

If I propose using Spark for a project, one of the first questions I get from management is about the cost of Databricks Spark on the cloud platform we're using, whether it's Azure, GCP, or AWS. If we could reduce the collection, system conversion, and transformation network costs by even just 2% to 3%, it would be a significant benefit for us.

What other advice do I have?

If your use case involves real-time applications frequently changing columns or data frames, then Spark is a fantastic option for you.

However, if you have a batch process and don't have a structural data analysis, I would suggest avoiding it. The high cost of cloud infrastructure combined with Apache Spark can be a significant burden in such scenarios.

Overall, I would rate the solution a nine out of ten.

Which deployment model are you using for this solution?

Public Cloud

Bharghava Raghavendra Beesa

Senior Developer at Infosys

Jan 21, 2025

Faster data transformations achieved but scheduling dependencies require external solutions

What is our primary use case?

I have some hands-on experience with Spark. I have one year of experience that should be considered as one year working with Spark, which is six months to one year. We use it for faster processing, especially compute.

Spark is used for transformations from large volumes of data, and it is usefully distributed. We receive data from various sources and need to transform it. The data is enormous, in terabytes, and often from specific databases. We perform transformations, aggregations, and deduplication.

We meet business requirements by computing data, minimizing it, aggregating it, or performing other operations. We typically write to Hive downstream.

What is most valuable?

Spark is faster and distributed. Previously, everything relied on MapReduce, which was slower. With Spark, multiple computations and transformations hold in memory for faster processing.

Real-time communication is possible, connecting with platforms like Kafka for real-time data import and compute. We implemented Spark and NiFi for integration. Spark replaced other costly products, reducing costs by thirty-eight percent.

What needs improvement?

The Spark solution could improve in scheduling tasks and managing dependencies. Spark alone cannot handle sequential tasks, requiring environments like Airflow scheduler or scripts. For instance, one task should trigger another based on completion, however, Spark can't manage these dependent loads. We focus on specific compute tasks that we can deliver.

For how long have I used the solution?

I have six months to one year of experience working with Spark.

What do I think about the stability of the solution?

Spark is stable, however, efficient use is necessary for running jobs seamlessly.

What do I think about the scalability of the solution?

Spark is scalable.

Which solution did I use previously and why did I switch?

I didn't work on any AI build projects for Spark, however, it supports external AI capabilities.

How was the initial setup?

The initial setup is complex. Logging methods require configuration, and it depends on matching with the cluster. Communicating within the node and setting up external logging supported by Spark are challenging.

Which other solutions did I evaluate?

On the compute side, I worked on Snowflake as well.

What other advice do I have?

I recommend Spark for working with large-scale big data. It is crucial to have skilled technicians. Overall product rating: seven out of ten.

Aleksandr Motuzov

Head of Data Science center of excellence at Ameriabank CJSC

Sep 23, 2024

Enhanced data processing with good support and helpful integration with Pandas syntax in distributed mode

What is our primary use case?

The primary use case for Apache Spark is to process data in memory, using big data, and distributing the engine to process said data. It is used for various tasks such as running the association rules algorithm in ML Spark ML, running XGBoost in parallel using the Spark engine, and preparing data for online machine learning using Spark Streaming mode.

How has it helped my organization?

The most significant cost savings come from the operational side because Spark is very typical in operations. There are many experts available in the market to operate Spark, making it easier to find the right personnel. It is quite mature, which reduces operation costs.

What is most valuable?

The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features. This allows running Pandas code distributed by using the Spark engine, which is a crucial feature. The integration with Pandas syntax in distributed mode, along with the user-defined functions in PySpark, is particularly valuable.

What needs improvement?

The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Consequently, alternatives like Doc DB are more preferable. Additionally, performance in some cases is slower, making alternatives two to five times faster.

For how long have I used the solution?

I have more than ten years of experience using Spark, starting from when it was first introduced.

What do I think about the stability of the solution?

Spark is very stable for our needs. It offers amazing stability.

What do I think about the scalability of the solution?

Scalability depends on how infrastructure is organized. Better balance and network considerations are necessary. However, Spark is very stable when scaled appropriately.

How are customer service and support?

Customer support for Apache Spark is very good. There is a lot of documentation and forums available, making it easier to find solutions. The Databricks team also does a lot to support Spark.

How would you rate customer service and support?

Positive

How was the initial setup?

The initial setup of Spark can take about a week, assuming the right infrastructure is already in place.

What about the implementation team?

A few technicians are typically required for installation and configuration. SRE engineers or operational guys handle the setup, as they need to understand the details about installation and configuration. Maintenance usually requires just an SRE engineer or operational guy.

What was our ROI?

The main benefit in terms of ROI comes from the operation side. Spark’s operational costs are lower due to the availability of experts and its maturity. However, performance costs might be higher due to the need for more memory and infrastructure.

What's my experience with pricing, setup cost, and licensing?

Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud solutions like Databricks can simplify the process, they may also be less cost-efficient.

What other advice do I have?

I'd rate the solution eight out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other

Title	Rating	Mindshare	Recommending
Spot by Flexera	4.3	N/A	100%	6 interviews Add to research
Spring Boot	4.2	N/A	93%	44 interviews Add to research

Apache Spark Reviews

What is Apache Spark?

Featured Apache Spark reviews

Apache Spark mindshare

PeerResearch reports based on Apache Spark reviews

Valuable Features

Room for Improvement

ROI

Pricing

Popular Use Cases

Service and Support

Deployment

Scalability

Stability

Review data by company size

Top industries

Compare Apache Spark with alternative products

Learn more about Apache Spark

Apache Spark customers

Related questions

Product Categories

Popular Comparisons

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?

How are customer service and support?

How would you rate customer service and support?

How was the initial setup?

What other advice do I have?

Which deployment model are you using for this solution?

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

How are customer service and support?

How would you rate customer service and support?

How was the initial setup?

What about the implementation team?

What other advice do I have?

Which deployment model are you using for this solution?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

Which solution did I use previously and why did I switch?

Which other solutions did I evaluate?

What other advice do I have?

Which deployment model are you using for this solution?

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

How are customer service and support?

How was the initial setup?

What other advice do I have?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?

How are customer service and support?

How would you rate customer service and support?

Which solution did I use previously and why did I switch?

How was the initial setup?

What was our ROI?

What's my experience with pricing, setup cost, and licensing?

Which other solutions did I evaluate?

What other advice do I have?

Which deployment model are you using for this solution?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?