Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory
Product | Market Share (%) |
---|---|
Apache Spark | 19.2% |
Cloudera Distribution for Hadoop | 23.3% |
HPE Ezmeral Data Fabric | 14.8% |
Other | 42.7% |
Type | Title | Date | |
---|---|---|---|
Category | Hadoop | Aug 27, 2025 | Download |
Product | Reviews, tips, and advice from real users | Aug 27, 2025 | Download |
Comparison | Apache Spark vs Cloudera Distribution for Hadoop | Aug 27, 2025 | Download |
Comparison | Apache Spark vs Amazon EMR | Aug 27, 2025 | Download |
Comparison | Apache Spark vs HPE Ezmeral Data Fabric | Aug 27, 2025 | Download |
Title | Rating | Mindshare | Recommending | |
---|---|---|---|---|
Spring Boot | 4.2 | N/A | 95% | 38 interviewsAdd to research |
Jakarta EE | 3.7 | N/A | 66% | 3 interviewsAdd to research |
Company Size | Count |
---|---|
Small Business | 24 |
Midsize Enterprise | 13 |
Large Enterprise | 25 |
Company Size | Count |
---|---|
Small Business | 131 |
Midsize Enterprise | 57 |
Large Enterprise | 470 |
NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Author info | Rating | Review Summary |
---|---|---|
Data Engineer at a tech company with 10,001+ employees | 5.0 | I use Apache Spark for real-time data processing and transformation across multiple sources like CRM and Siebel. It's reliable, fast, and improves our decision-making, though I see future needs for better integration with emerging cloud solutions. |
Senior Developer at Infosys | 3.5 | No summary available |
Head of Data at a energy/utilities company with 51-200 employees | 4.0 | Apache Spark significantly reduced operational costs by 50% and although it supports parallel processing, it needs improvements in scalability and user-friendliness. Working with datasets isn't as straightforward as with Pandas, though it's flexible and functional. |
Senior Software Architect at USEReady | 4.0 | No summary available |
Head of Data Science center of excellence at Ameriabank CJSC | 4.0 | I use Apache Spark primarily for in-memory processing of big data, which is valuable for tasks like running ML algorithms. Although its Pandas UDF support is advantageous, the Java overhead and performance issues suggest alternatives may be preferable. |
Sr Manager at a transportation company with 10,001+ employees | 4.5 | I use Apache Spark for real-time data processing and ETL tasks. It offers unparalleled features but faces limitations due to its in-memory implementation. Despite improvements in version 3.0, reducing costs and addressing memory issues would enhance it further. |
Data Scientist at a financial services firm with 10,001+ employees | 4.5 | I primarily use Apache Spark for data processing tasks involving large datasets, appreciating its ease of use and portability. While it's efficient for both small and large datasets, the lack of support for geospatial data is a limitation. |
Data engineer at Cocos pt | 4.5 | We use Apache Spark primarily for Spark SQL and occasionally Spark Streaming, processing data from sources like SAP and Azure Data Warehouse. Its in-memory processing significantly outperforms Hadoop, offering faster data handling and enhanced query optimization. |