Try our new research platform with insights from 80,000+ expert users

Apache Spark vs Cloudera Data Platform comparison

 

Comparison Buyer's Guide

Executive SummaryUpdated on Apr 1, 2025

Review summaries and opinions

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Categories and Ranking

Apache Spark
Average Rating
8.4
Reviews Sentiment
7.7
Number of Reviews
66
Ranking in other categories
Hadoop (1st), Compute Service (5th), Java Frameworks (2nd)
Cloudera Data Platform
Average Rating
8.0
Reviews Sentiment
6.4
Number of Reviews
27
Ranking in other categories
Cloud Master Data Management (MDM) Solutions (10th), Data Management Platforms (DMP) (7th)
 

Featured Reviews

Ilya Afanasyev - PeerSpot reviewer
Reliable, able to expand, and handle large amounts of data well
We use batch processing. It works well with our formats and file versions. There's a lot of functionality. In our pipeline each hour, we make a copy of data from MongoDB, of the changes from MongoDB to some specific file. Each time pipeline copied all of the data, it would do it each time without changes to all of the tables. Tables have a lot of data, and in the last MongoDB version, there is a possibility to read only changed data. This reduced the cost and configuration of the cluster, and we saved about $150,000. The solution is scalable. It's a stable product.
Miodrag-Stanic - PeerSpot reviewer
Distributed computing improves data processing while upgrade complexity needs addressing
There are challenges with upgrading or updating various services like Spark, Impala, and Hive on on-premise and bare metal solutions. We aim to address these issues with a Kubernetes-based platform that will simplify the task of upgrading services. We also wish to implement lakehouse capabilities with Iceberg or Delta Lake frameworks.

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Pros

"Apache Spark can do large volume interactive data analysis."
"The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics."
"One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."
"One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
"The most valuable feature of Apache Spark is its flexibility."
"The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations."
"The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."
"It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."
"Hortonworks should not be expensive at all to those looking into using it."
"The upgrades and patches must come from Hortonworks."
"Ambari Web UI: user-friendly."
"Integration with other tools works well for us and we successfully scaled the solution after two to three years without any issues."
"The scalability is the key reason why we are on this platform."
"Now, using this solution, it is much cheaper to have all of the data available for searching, not in real-time, but whenever there is a pending request."
"Distributed computing, secure containerization, and governance capabilities are the most valuable features."
"It is a scalable platform."
 

Cons

"The solution needs to optimize shuffling between workers."
"The main concern is the overhead of Java when distributed processing is not necessary."
"Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."
"Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing."
"There could be enhancements in optimization techniques, as there are some limitations in this area that could be addressed to further refine Spark's performance."
"In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."
"Apache Spark lacks geospatial data."
"It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework."
"It would also be nice if there were less coding involved."
"Since Cloudera acquired HDP, it's been bundled with CBH and HDP. However, the biggest challenge is cloud storage integration with Azure, GCP, and AWS."
"Hive performance. If Hive performance increased, Hadoop would replace (not everywhere) traditional databases."
"Deleting any service requires a lot of clean up, unlike Cloudera."
"The version control of the software is also an issue."
"The cost of the solution is high and there is room for improvement."
"Security and workload management need improvement."
"The initial setup may take several hours or days, depending on the challenges faced during installation. It's not always a smooth process due to potential complexities."
 

Pricing and Cost Advice

"On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing."
"They provide an open-source license for the on-premise version."
"The product is expensive, considering the setup."
"Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure. If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources."
"Apache Spark is an expensive solution."
"Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera."
"It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project."
"Apache Spark is an open-source tool."
"It is priced well and it is affordable"
"Currently, we are using the product in a sandbox environment, and there is no licensing. We might choose a licensing option once we get the results."
report
Use our free recommendation engine to learn which Hadoop solutions are best for your needs.
851,471 professionals have used our research since 2012.
 

Top Industries

By visitors reading reviews
Financial Services Firm
27%
Computer Software Company
13%
Manufacturing Company
8%
Comms Service Provider
6%
No data available
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What is your experience regarding pricing and costs for Apache Spark?
Apache Spark is open-source, so it doesn't incur any charges.
What needs improvement with Apache Spark?
There is complexity when it comes to understanding the whole ecosystem, especially for beginners. I find it quite complex to understand how a Spark job is initiated, the roles of driver nodes, work...
What do you like most about Hortonworks Data Platform?
Distributed computing, secure containerization, and governance capabilities are the most valuable features.
What is your experience regarding pricing and costs for Hortonworks Data Platform?
The pricing model for Cloudera Data Platform is complex and has increased significantly compared to CDH. Initially, CDH had a straightforward pricing model based on nodes, but CDP includes factors ...
What needs improvement with Hortonworks Data Platform?
There are challenges with upgrading or updating various services like Spark, Impala, and Hive on on-premise and bare metal solutions. We aim to address these issues with a Kubernetes-based platform...
 

Overview

 

Sample Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Information Not Available
Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop. Updated: May 2025.
851,471 professionals have used our research since 2012.