Apache Spark Reviews

Name: Apache Spark
Brand: Apache
Rating: 4.2 (66 reviews)

Vendor: Apache

4.2 out of 5

66 reviews
90% willing to recommend

1,098 followers

Start review

What is Apache Spark?

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Get the Apache Spark Buyer's Guide and find out what your peers are saying about Apache Spark, Spring Boot, Jakarta EE and more!

Apache Spark is the #1 ranked solution in top Hadoop solutions, #2 ranked solution in top Java Frameworks, and #4 ranked solution in top Compute Service solutions. PeerSpot users give Apache Spark an average rating of 8.4 out of 10. Apache Spark is most commonly compared to Spring Boot: Apache Spark vs Spring Boot. Apache Spark is popular among the large enterprise segment, accounting for 71% of users researching this solution on PeerSpot. The top industry researching this solution are professionals from a financial services firm, accounting for 27% of all views.

Helped 860,168 peers since 2012

Featured Apache Spark reviews

Dunstan Matekenya

Data Scientist at a financial services firm with 10,001+ employees

Apache Spark is known for its ease of use. Compared to other available data processing frameworks, it is user-friendly. While many choices now exist, Spark remains easy to use, particularly with Python. You can utilize familiar programming styles similar to Pandas in Python, including object-oriented programming. Another advantage is its portability. I can prototype and perform some initial tasks on my laptop using Spark without needing to be on Databricks or any cloud platform. I can transfer it to Databricks or other platforms, such as AWS. This flexibility allows me to improve processing even on my laptop. For instance, if I'm processing large amounts of data and find my laptop becoming slow, I can quickly switch to Spark. It handles small and large datasets efficiently, making it a versatile tool for various data processing needs.

Read full review

Bharghava Raghavendra Beesa

Senior Developer at Infosys

The Spark solution could improve in scheduling tasks and managing dependencies. Spark alone cannot handle sequential tasks, requiring environments like Airflow scheduler or scripts. For instance, one task should trigger another based on completion, however, Spark can't manage these dependent loads. We focus on specific compute tasks that we can deliver.

Read full review

Madhan Potluri

Head of Data at a energy/utilities company with 51-200 employees

The only issue I faced with the tool was that I used to choose the compute device to support parallel processing, and it has to be more like scaling up horizontally. The tool should be more scalable, not in terms of increasing the CPU or something, but more in the area of units. If two units are not enough, the third or fourth unit should be able to come into the picture. From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable. Sometimes, I get an error saying that it is an RDD-related error, and it becomes difficult to understand where it went wrong. When I deal with datasets using a library called Pandas in Python, I can actually apply functions on each column and get a transformation from the column. When I try to do the same thing with Apache Spark, it is okay and works, but it is not straightforward; I need to deal with it a little differently, and even after trying to do that differently, the problem I face there is, sometimes it will throw an error saying that it is looping back to the same, but I was not getting that kind of errors in Pandas. In future updates, the tool should be made more user-friendly. I want to take fifty parallel processes rather than one, and I want to pick some particular columns to be split by partition, so if the tool is user-friendly and offers clarity and flexibility, then that will be good.

Read full review

Apache Spark mindshare

Product category:

As of July 2025, the mindshare of Apache Spark in the Hadoop category stands at 18.3%, down from 20.4% compared to the previous year, according to calculations based on PeerSpot user engagement data.

Hadoop

PeerResearch reports based on Apache Spark reviews

Type	Title	Date
Category	Hadoop	Jul 4, 2025	Download
Product	Reviews, tips, and advice from real users	Jul 4, 2025	Download
Comparison	Apache Spark vs Cloudera Distribution for Hadoop	Jul 4, 2025	Download
Comparison	Apache Spark vs Amazon EMR	Jul 4, 2025	Download
Comparison	Apache Spark vs HPE Ezmeral Data Fabric	Jul 4, 2025	Download

Title	Rating	Mindshare	Recommending
Spring Boot	4.2	N/A	95%	38 interviews Add to research
Jakarta EE	3.7	N/A	66%	3 interviews Add to research

Valuable Features

"Spark is used for transformations from large volumes of data, and it is usefully distributed."
"The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features."
"I like Apache Spark's flexibility the most. Before, we had one server that would choke up. With the solution, we can easily add more nodes when needed. The machine learning models are also really helpful. We use them to predict energy theft and find infrastructure problems."

Room for Improvement

"The Spark solution could improve in scheduling tasks and managing dependencies."
"The main concern is the overhead of Java when distributed processing is not necessary."
"For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial."

ROI

Apache Spark delivers high returns by enabling significant cost savings and enhanced performance, particularly through efficient use of resources and expertise. Many see a 50 percent reduction in operational expenses and time savings, despite increased computing demands. Organizations benefit from reduced startup time and lower overall operational costs. However, performance costs might rise due to additional memory and infrastructure needs. The adoption of Spark for machine learning analytics enhances overall value when leveraging customer data.

Pricing

"I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten."
"Considering the product version used in my company, I feel that the tool is not costly since the product is available for free."
"The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks."

Service and Support

Apache Spark's customer service and technical support vary widely. Many rely on community forums and documentation as there is no official support with the open-source platform. Some enterprises use third-party vendors like Cloudera for assistance, receiving timely responses. Databricks users appreciate support quality. Developers often find solutions independently or through community interactions, frequently with good success. Paid support offers efficient responses, while community support provides satisfactory help. Commercial support availability differs among vendors.

Scalability

Apache Spark is highly scalable, especially in cloud environments. Users highlight its flexibility and effectiveness in large-scale applications. Many organizations, while needing technical expertise for optimization, report successful scaling with minimal issues. Some suggest that adding new nodes may require time, but once properly set up, it efficiently handles significant data loads. The tool's capacity to accommodate diverse user bases and data requirements proves beneficial for extensive usage scenarios.

Stability

Apache Spark is widely recognized for its stability, though some users encounter configuration-related challenges. While older versions occasionally had issues, many have been resolved in recent updates. Users report stable performance, with exceptions during high data volume or improperly managed resources. Most users rate its stability positively, even amid complex data operations. Organizations like Facebook and Netflix leverage Spark globally, underscoring its robust reputation. Stability ratings often range between eight and ten out of ten.

These insights are based on the in-depth reviews provided by peers to help you make a better buying decision.

Download our Apache Spark Buyer's Guide for additional reliable information.

Review data by company size

By reviewers

By visitors reading reviews

Top industries

By visitors reading reviews

Financial Services Firm

27%

Computer Software Company

12%

Manufacturing Company

Comms Service Provider

University

Retailer

Government

Educational Organization

Insurance Company

Healthcare Company

Real Estate/Law Firm

Media Company

Construction Company

Hospitality Company

Non Profit

Recreational Facilities/Services Company

Performing Arts

Legal Firm

Pharma/Biotech Company

Energy/Utilities Company

Transportation Company

Outsourcing Company

Consumer Goods Company

Compare Apache Spark with alternative products

Learn more about Apache Spark

Apache Spark customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions