Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory



| Product | Market Share (%) | 
|---|---|
| Apache Spark | 19.0% | 
| Cloudera Distribution for Hadoop | 21.9% | 
| HPE Ezmeral Data Fabric | 14.4% | 
| Other | 44.7% | 
| Type | Title | Date | |
|---|---|---|---|
| Category | Hadoop | Oct 25, 2025 | Download | 
| Product | Reviews, tips, and advice from real users | Oct 25, 2025 | Download | 
| Comparison | Apache Spark vs Cloudera Distribution for Hadoop | Oct 25, 2025 | Download | 
| Comparison | Apache Spark vs Amazon EMR | Oct 25, 2025 | Download | 
| Comparison | Apache Spark vs HPE Ezmeral Data Fabric | Oct 25, 2025 | Download | 
| Title | Rating | Mindshare | Recommending | |
|---|---|---|---|---|
| Spring Boot | 4.2 | N/A | 95% | 41 interviewsAdd to research | 
| Jakarta EE | 3.7 | N/A | 66% | 3 interviewsAdd to research | 
| Company Size | Count | 
|---|---|
| Small Business | 24 | 
| Midsize Enterprise | 13 | 
| Large Enterprise | 25 | 
| Company Size | Count | 
|---|---|
| Small Business | 119 | 
| Midsize Enterprise | 48 | 
| Large Enterprise | 384 | 
NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
| Author info | Rating | Review Summary | 
|---|---|---|
| Data Engineer at a tech company with 10,001+ employees | 5.0 | I use Apache Spark for real-time data processing and transformation across multiple sources like CRM and Siebel. It's reliable, fast, and improves our decision-making, though I see future needs for better integration with emerging cloud solutions. | 
| Senior Developer at Infosys | 3.5 | No summary available | 
| Senior Software Architect at USEReady | 4.0 | No summary available | 
| Sr Manager at a transportation company with 10,001+ employees | 4.5 | I use Apache Spark for real-time data processing and ETL tasks. It offers unparalleled features but faces limitations due to its in-memory implementation. Despite improvements in version 3.0, reducing costs and addressing memory issues would enhance it further. | 
| Data Scientist at a financial services firm with 10,001+ employees | 4.5 | I primarily use Apache Spark for data processing tasks involving large datasets, appreciating its ease of use and portability. While it's efficient for both small and large datasets, the lack of support for geospatial data is a limitation. | 
| Data engineer at Cocos pt | 4.5 | We use Apache Spark primarily for Spark SQL and occasionally Spark Streaming, processing data from sources like SAP and Azure Data Warehouse. Its in-memory processing significantly outperforms Hadoop, offering faster data handling and enhanced query optimization. | 
| Head of Data at a energy/utilities company with 51-200 employees | 4.0 | Apache Spark significantly reduced operational costs by 50% and although it supports parallel processing, it needs improvements in scalability and user-friendliness. Working with datasets isn't as straightforward as with Pandas, though it's flexible and functional. | 
| Director Product Development at Mycom Osi | 4.0 | In my company, we use Apache Spark for topology engines and chains. While it is a valuable tool, finding skilled developers is challenging. The deployment phase sometimes requires manual interventions, especially with large datasets, indicating areas for improvement. |