Try our new research platform with insights from 80,000+ expert users

Apache Spark vs Cloudera Data Platform comparison

 

Comparison Buyer's Guide

Executive SummaryUpdated on Apr 1, 2025

Review summaries and opinions

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Categories and Ranking

Apache Spark
Average Rating
8.4
Reviews Sentiment
7.7
Number of Reviews
66
Ranking in other categories
Hadoop (1st), Compute Service (5th), Java Frameworks (2nd)
Cloudera Data Platform
Average Rating
8.0
Reviews Sentiment
6.4
Number of Reviews
27
Ranking in other categories
Cloud Master Data Management (MDM) Solutions (10th), Data Management Platforms (DMP) (7th)
 

Featured Reviews

Ilya Afanasyev - PeerSpot reviewer
Reliable, able to expand, and handle large amounts of data well
We use batch processing. It works well with our formats and file versions. There's a lot of functionality. In our pipeline each hour, we make a copy of data from MongoDB, of the changes from MongoDB to some specific file. Each time pipeline copied all of the data, it would do it each time without changes to all of the tables. Tables have a lot of data, and in the last MongoDB version, there is a possibility to read only changed data. This reduced the cost and configuration of the cluster, and we saved about $150,000. The solution is scalable. It's a stable product.
Miodrag-Stanic - PeerSpot reviewer
Distributed computing improves data processing while upgrade complexity needs addressing
There are challenges with upgrading or updating various services like Spark, Impala, and Hive on on-premise and bare metal solutions. We aim to address these issues with a Kubernetes-based platform that will simplify the task of upgrading services. We also wish to implement lakehouse capabilities with Iceberg or Delta Lake frameworks.

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Pros

"The product's initial setup phase was easy."
"With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."
"One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."
"The fault tolerant feature is provided."
"The solution is scalable."
"The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast."
"It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."
"The product's deployment phase is easy."
"The data platform is pretty neat. The workflow is also really good."
"Distributed computing, secure containerization, and governance capabilities are the most valuable features."
"Integration with other tools works well for us and we successfully scaled the solution after two to three years without any issues."
"The scalability is the key reason why we are on this platform."
"The product offers a fairly easy setup process."
"Ranger for security; with Ranger we can manager user’s permissions/access controls very easily."
"We use it for data science activities."
"Ambari Web UI: user-friendly."
 

Cons

"It requires overcoming a significant learning curve due to its robust and feature-rich nature."
"I would like to see integration with data science platforms to optimize the processing capability for these tasks."
"Spark could be improved by adding support for other open-source storage layers than Delta Lake."
"More ML based algorithms should be added to it, to make it algorithmic-rich for developers."
"At the initial stage, the product provides no container logs to check the activity."
"Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."
"We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data."
"For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial."
"The cost of the solution is high and there is room for improvement."
"The version control of the software is also an issue."
"I work a lot with banking, IT and communications customers. Hortonworks must improve or must upgrade their services for these sectors."
"Deleting any service requires a lot of clean up, unlike Cloudera."
"I would like to see more support for containers such as Docker and OpenShift."
"For on-premise use, I would not recommend Cloudera Data Platform as it is expensive and complicated to upgrade."
"It's at end of life and no longer will there be improvements."
"More information could be there to simplify the process of running the product."
 

Pricing and Cost Advice

"Apache Spark is an open-source solution, and there is no cost involved in deploying the solution on-premises."
"Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free."
"On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing."
"Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."
"We are using the free version of the solution."
"It is an open-source solution, it is free of charge."
"Considering the product version used in my company, I feel that the tool is not costly since the product is available for free."
"They provide an open-source license for the on-premise version."
"It is priced well and it is affordable"
"Currently, we are using the product in a sandbox environment, and there is no licensing. We might choose a licensing option once we get the results."
report
Use our free recommendation engine to learn which Hadoop solutions are best for your needs.
851,604 professionals have used our research since 2012.
 

Top Industries

By visitors reading reviews
Financial Services Firm
26%
Computer Software Company
13%
Manufacturing Company
8%
Comms Service Provider
6%
No data available
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What is your experience regarding pricing and costs for Apache Spark?
Apache Spark is open-source, so it doesn't incur any charges.
What needs improvement with Apache Spark?
There is complexity when it comes to understanding the whole ecosystem, especially for beginners. I find it quite complex to understand how a Spark job is initiated, the roles of driver nodes, work...
What do you like most about Hortonworks Data Platform?
Distributed computing, secure containerization, and governance capabilities are the most valuable features.
What is your experience regarding pricing and costs for Hortonworks Data Platform?
The pricing model for Cloudera Data Platform is complex and has increased significantly compared to CDH. Initially, CDH had a straightforward pricing model based on nodes, but CDP includes factors ...
What needs improvement with Hortonworks Data Platform?
There are challenges with upgrading or updating various services like Spark, Impala, and Hive on on-premise and bare metal solutions. We aim to address these issues with a Kubernetes-based platform...
 

Overview

 

Sample Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Information Not Available
Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop. Updated: May 2025.
851,604 professionals have used our research since 2012.