Apache Spark Reviews

Name: Apache Spark
Brand: Apache
Rating: 4.2 (67 reviews)

4.2 out of 5

67 reviews
90% willing to recommend

What is Apache Spark?

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Get the Apache Spark Buyer's Guide and find out what your peers are saying about Apache Spark, Spring Boot, Jakarta EE and more!

Apache Spark is the #2 ranked solution in top Hadoop solutions, #2 ranked solution in top Java Frameworks, and #4 ranked solution in top Compute Service solutions. PeerSpot users give Apache Spark an average rating of 8.4 out of 10. Apache Spark is most commonly compared to Spring Boot: Apache Spark vs Spring Boot. Apache Spark is popular among the large enterprise segment, accounting for 70% of users researching this solution on PeerSpot. The top industry researching this solution are professionals from a financial services firm, accounting for 26% of all views.

Helped 871,469 peers since 2012

Featured Apache Spark reviews

Omar Khaled

Data Engineer at a tech company with 10,001+ employees

I can improve the organization's functions by taking less time to make decisions. To make the right decision, you need the right data, and a solution can provide this by hiring talent and employees who can consolidate data from different sources and organize it. Not all solutions can make this data fast enough to be used, except for solutions such as Apache Spark Structured Streaming. To make the right decision, you should have both accurate and fast data. Apache Spark itself is similar to the Python programming language. Python is a language with many libraries for mathematics and machine learning. Apache Spark is the solution, and within it, you have PySpark, which is the API for Apache Spark to write and run Python code. Within it, there are many APIs, including SQL APIs, allowing you to write SQL code within a Python function in Apache Spark. You can also use Apache Spark Structured Streaming and machine learning APIs.

Read full review

Bharghava Raghavendra Beesa

Senior Developer at Infosys

The Spark solution could improve in scheduling tasks and managing dependencies. Spark alone cannot handle sequential tasks, requiring environments like Airflow scheduler or scripts. For instance, one task should trigger another based on completion, however, Spark can't manage these dependent loads. We focus on specific compute tasks that we can deliver.

Read full review

KamleshPant

Senior Software Architect at USEReady

Apache Spark's ability to handle both batch and streaming data is the most valuable feature for me. It is beneficial for consuming real-time data. It offers solid real-time processing capability, making it more efficient in managing data analytics. It is beneficial as it allows processing of both batch and streaming data seamlessly.

Read full review

Apache Spark mindshare

Product category:

As of October 2025, the mindshare of Apache Spark in the Hadoop category stands at 19.0%, up from 18.7% compared to the previous year, according to calculations based on PeerSpot user engagement data.

Hadoop Market Share Distribution
Product	Market Share (%)
Apache Spark	19.0%
Cloudera Distribution for Hadoop	21.9%
HPE Ezmeral Data Fabric	14.4%
Other	44.7%

Hadoop

PeerResearch reports based on Apache Spark reviews

Type	Title	Date
Category	Hadoop	Oct 22, 2025	Download
Product	Reviews, tips, and advice from real users	Oct 22, 2025	Download
Comparison	Apache Spark vs Cloudera Distribution for Hadoop	Oct 22, 2025	Download
Comparison	Apache Spark vs Amazon EMR	Oct 22, 2025	Download
Comparison	Apache Spark vs HPE Ezmeral Data Fabric	Oct 22, 2025	Download

Title	Rating	Mindshare	Recommending
Spring Boot	4.2	N/A	95%	41 interviews Add to research
Jakarta EE	3.7	N/A	66%	3 interviews Add to research

Valuable Features

Apache Spark is recognized for its speed, scalability, ease of use, and ability to handle large datasets. Key features include Spark Streaming, Spark SQL, machine learning with MLlib, in-memory processing, and distributed computing. Users appreciate its fast performance, fault tolerance, and real-time processing capabilities. Compatibility with languages like Python, Scala, and Java enhances its usability. The ability to execute SQL-like queries and the flexibility in managing data pipelines significantly improve data processing efficiency.

"Apache Spark resolves many problems in the MapReduce solution and Hadoop, such as the inability to run effective Python or machine learning algorithms."
"Spark is used for transformations from large volumes of data, and it is usefully distributed."
"The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features."

Room for Improvement

Apache Spark could enhance its scalability and stability. Its real-time query capabilities are lacking, making it difficult for some users without technical backgrounds. More intuitive user interfaces and user-friendly documentation are needed. The integration with BI tools could be expanded. Memory management and garbage collection need optimization, and adding support for additional machine learning algorithms would be beneficial. Users desire improved task scheduling, dependency management, and easier setup for non-technical individuals.

"The basic improvement would be to have integration with these solutions."
"The Spark solution could improve in scheduling tasks and managing dependencies."
"The main concern is the overhead of Java when distributed processing is not necessary."

ROI

Apache Spark has enabled significant cost savings, reducing operational expenses by 50 percent. By optimizing startup time for customers and leveraging specific skill sets, users experience enhanced performance and reduced financial outlay. Spark's open-source nature complements reduced costs, though higher memory and infrastructure requirements may increase performance expenses. While specific ROI percentages vary based on the cluster, the overall impact remains financially beneficial, evidenced by substantial savings in billion-dollar operations.

Pricing

Apache Spark is primarily an open-source solution, leading many organizations to utilize it free of licensing fees, especially for on-premises deployments. However, costs arise from infrastructure requirements, particularly with cloud implementations involving substantial hardware, memory, and management expenses. While some companies opt for bundled services like Cloudera or Databricks for enhanced support and features, these involve additional charges. Enterprise buyers should carefully assess their infrastructure needs and potential cloud costs associated with Apache Spark deployments.

"I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten."
"Considering the product version used in my company, I feel that the tool is not costly since the product is available for free."
"The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks."

Popular Use Cases

Apache Spark is primarily used for processing large data sets, enabling data processing and transformation in real-time and batch scenarios. Users employ it for ETL processes, data lakes, and analytics, leveraging its in-memory computing power for fast processing. It assists in building data pipelines, handling streaming data, and powering machine learning applications. Apache Spark's scalability and flexibility allow seamless integration with cloud platforms and various data sources, proving vital for organizations handling extensive data operations.

Service and Support

Apache Spark's open-source nature means official support isn't widespread. Users frequently rely on vibrant community forums for help, though responses can vary. Some prefer commercial support through vendors like Cloudera, providing stronger assistance. Feedback about customer service and technical support tends to be mixed; while community forums and documentation are helpful, dedicated vendor support may enhance response times, especially for complex issues. Some emphasize the value of internal expertise for problem-solving.

Deployment

Apache Spark's initial setup varies in complexity based on deployment type and expertise level. Cloud setups are often quick and straightforward, while on-premise environments require more time and configuration, especially with security measures. Experienced teams find it manageable, but beginners may face challenges due to complex dependencies and configurations. Support from specialized professionals can ease the process. Self-managed clusters demand significant effort in resource allocation and integration with services.

Scalability

Users find Apache Spark highly scalable, with no major issues. Its performance is superior to Python and R, efficiently supporting many users and large data volumes. While setting up requires technical skills, effective scaling depends on cluster size and infrastructure management. Companies using it emphasize its processing capabilities, allowing multiple nodes and instances, supporting vast data processing tasks. It requires monitoring to optimize scaling but is rated highly for scalability.

Stability

Apache Spark is recognized for its stability, with many users reporting no significant issues. Some encountered difficulties with standalone deployment, Spark Streaming, and large datasets, especially in earlier versions. Version updates have improved reliability. Stability ratings often reach nine or ten out of ten. Memory errors and schema changes can occur under high data loads but are manageable. Many prefer Spark for its ease compared to alternatives like Flink and its effective handling of Python and machine learning algorithms.

These insights are based on the in-depth reviews provided by peers to help you make a better buying decision.

Download our Apache Spark Buyer's Guide for additional reliable information.

Review data by company size

By reviewers
Company Size	Count
Small Business	24
Midsize Enterprise	13
Large Enterprise	25

By reviewers

By visitors reading reviews
Company Size	Count
Small Business	118
Midsize Enterprise	47
Large Enterprise	384

By visitors reading reviews

Top industries

By visitors reading reviews

Financial Services Firm

26%

Computer Software Company

11%

Comms Service Provider

Manufacturing Company

Government

University

Insurance Company

Retailer

Educational Organization

Healthcare Company

Construction Company

Non Profit

Real Estate/Law Firm

Media Company

Performing Arts

Recreational Facilities/Services Company

Outsourcing Company

Hospitality Company

Legal Firm

Pharma/Biotech Company

Transportation Company

Consumer Goods Company

Energy/Utilities Company

Recruiting/Hr Firm

Renewables & Environment Company

Compare Apache Spark with alternative products

Learn more about Apache Spark

Apache Spark customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Product Categories

Hadoop

Compute Service

Java Frameworks

Popular Comparisons

Spring Boot vs Apache Spark

Jakarta EE vs Apache Spark

Cloudera Distribution for Hadoop vs Apache Spark

AWS Lambda vs Apache Spark

Amazon EMR vs Apache Spark

AWS Fargate vs Apache Spark

Apache NiFi vs Apache Spark

AWS Batch vs Apache Spark

Spot vs Apache Spark

Amazon EC2 vs Apache Spark

Amazon EC2 Auto Scaling vs Apache Spark

HPE Ezmeral Data Fabric vs Apache Spark

Helidon vs Apache Spark

Vert.x vs Apache Spark

Spring MVC vs Apache Spark

See all alternatives

Apache Spark Reviews Summary
Author info	Rating	Review Summary
Data Engineer at a tech company with 10,001+ employees	5.0	I use Apache Spark for real-time data processing and transformation across multiple sources like CRM and Siebel. It's reliable, fast, and improves our decision-making, though I see future needs for better integration with emerging cloud solutions.
Senior Developer at Infosys	3.5	No summary available
Senior Software Architect at USEReady	4.0	No summary available
Sr Manager at a transportation company with 10,001+ employees	4.5	I use Apache Spark for real-time data processing and ETL tasks. It offers unparalleled features but faces limitations due to its in-memory implementation. Despite improvements in version 3.0, reducing costs and addressing memory issues would enhance it further.
Data Scientist at a financial services firm with 10,001+ employees	4.5	I primarily use Apache Spark for data processing tasks involving large datasets, appreciating its ease of use and portability. While it's efficient for both small and large datasets, the lack of support for geospatial data is a limitation.
Data engineer at Cocos pt	4.5	We use Apache Spark primarily for Spark SQL and occasionally Spark Streaming, processing data from sources like SAP and Azure Data Warehouse. Its in-memory processing significantly outperforms Hadoop, offering faster data handling and enhanced query optimization.
Head of Data at a energy/utilities company with 51-200 employees	4.0	Apache Spark significantly reduced operational costs by 50% and although it supports parallel processing, it needs improvements in scalability and user-friendliness. Working with datasets isn't as straightforward as with Pandas, though it's flexible and functional.
Director Product Development at Mycom Osi	4.0	In my company, we use Apache Spark for topology engines and chains. While it is a valuable tool, finding skilled developers is challenging. The deployment phase sometimes requires manual interventions, especially with large datasets, indicating areas for improvement.

Apache Spark Reviews

What is Apache Spark?

Featured Apache Spark reviews

Apache Spark mindshare

PeerResearch reports based on Apache Spark reviews

Valuable Features

Room for Improvement

ROI

Pricing

Popular Use Cases

Service and Support

Deployment

Scalability

Stability

Review data by company size

Top industries

Compare Apache Spark with alternative products

Learn more about Apache Spark

Apache Spark customers

Related questions

Product Categories

Popular Comparisons