Try our new research platform with insights from 80,000+ expert users

Apache Spark vs Google Cloud Dataflow comparison

 

Comparison Buyer's Guide

Executive Summary

Review summaries and opinions

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Categories and Ranking

Apache Spark
Average Rating
8.4
Reviews Sentiment
7.7
Number of Reviews
66
Ranking in other categories
Hadoop (1st), Compute Service (5th), Java Frameworks (2nd)
Google Cloud Dataflow
Average Rating
8.0
Reviews Sentiment
7.3
Number of Reviews
13
Ranking in other categories
Streaming Analytics (6th)
 

Mindshare comparison

Apache Spark and Google Cloud Dataflow aren’t in the same category and serve different purposes. Apache Spark is designed for Hadoop and holds a mindshare of 17.8%, down 21.4% compared to last year.
Google Cloud Dataflow, on the other hand, focuses on Streaming Analytics, holds 7.1% mindshare, down 7.2% since last year.
Hadoop
Streaming Analytics
 

Featured Reviews

Ilya Afanasyev - PeerSpot reviewer
Reliable, able to expand, and handle large amounts of data well
We use batch processing. It works well with our formats and file versions. There's a lot of functionality. In our pipeline each hour, we make a copy of data from MongoDB, of the changes from MongoDB to some specific file. Each time pipeline copied all of the data, it would do it each time without changes to all of the tables. Tables have a lot of data, and in the last MongoDB version, there is a possibility to read only changed data. This reduced the cost and configuration of the cluster, and we saved about $150,000. The solution is scalable. It's a stable product.
Jana Polianskaja - PeerSpot reviewer
Build Scalable Data Pipelines with Apache Beam and Google Cloud Dataflow
As a data engineer, I find several features of Google Cloud Dataflow particularly valuable. The ability to test solutions locally using Direct Runner is crucial for development, allowing me to validate pipelines without incurring the costs of full Dataflow jobs. The unified programming model for both batch and streaming processing is exceptional - requiring only minor code adjustments to optimize for either mode. This flexibility extends to language support, with robust implementations in both Java and Python, allowing teams to leverage their existing expertise. The platform's comprehensive monitoring capabilities are another standout feature. The intuitive interface, Grafana integration, and extensive service connectivity make troubleshooting and performance tracking highly efficient. Furthermore, seamless integration with Google Cloud Composer (managed Airflow) enables sophisticated orchestration of data pipelines.

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Pros

"The product’s most valuable features are lazy evaluation and workload distribution."
"We use it for ETL purposes as well as for implementing the full transformation pipelines."
"One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
"The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics."
"We use Spark to process data from different data sources."
"This solution provides a clear and convenient syntax for our analytical tasks."
"The solution has been very stable."
"I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library."
"The most valuable features of Google Cloud Dataflow are the integration, it's very simple if you have the complete stack, which we are using. It is overall very easy to use, user-friendly friendly, and cost-effective if you know how to use it. The solution is very flexible for programmers, if you know how to do scripts or program in Python or any other language, it's extremely easy to use."
"The integration within Google Cloud Platform is very good."
"The solution allows us to program in any language we desire."
"It is a scalable solution."
"Google Cloud Dataflow is useful for streaming and data pipelines."
"I would rate the overall solution a ten out of ten."
"The service is relatively cheap compared to other batch-processing engines."
"The support team is good and it's easy to use."
 

Cons

"Apache Spark should add some resource management improvements to the algorithms."
"The setup I worked on was really complex."
"When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data."
"Apart from the restrictions that come with its in-memory implementation. It has been improved significantly up to version 3.0, which is currently in use."
"When using Spark, users may need to write their own parallelization logic, which requires additional effort and expertise."
"For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial."
"It requires overcoming a significant learning curve due to its robust and feature-rich nature."
"When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."
"Promoting the technology more broadly would help increase its adoption."
"Google Cloud Dataflow should include a little cost optimization."
"When I deploy the product in local errors, a lot of errors pop up which are not always caught. The solution's error logging is bad. It can take a lot of time to debug the errors. It needs to have better logs."
"The deployment time could also be reduced."
"Occasionally, dealing with a huge volume of data causes failure due to array size."
"There are certain challenges regarding the Google Cloud Composer which can be improved."
"The technical support has slight room for improvement."
"They should do a market survey and then make improvements."
 

Pricing and Cost Advice

"Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure. If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources."
"The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks."
"Apache Spark is an open-source tool."
"Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera."
"On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing."
"They provide an open-source license for the on-premise version."
"It is an open-source solution, it is free of charge."
"It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project."
"The tool is cheap."
"The solution is not very expensive."
"The price of the solution depends on many factors, such as how they pay for tools in the company and its size."
"The solution is cost-effective."
"Google Cloud is slightly cheaper than AWS."
"On a scale from one to ten, where one is cheap, and ten is expensive, I rate Google Cloud Dataflow's pricing a four out of ten."
"Google Cloud Dataflow is a cheap solution."
"On a scale from one to ten, where one is cheap, and ten is expensive, I rate the solution's pricing a seven to eight out of ten."
report
Use our free recommendation engine to learn which Hadoop solutions are best for your needs.
851,174 professionals have used our research since 2012.
 

Top Industries

By visitors reading reviews
Financial Services Firm
26%
Computer Software Company
13%
Manufacturing Company
8%
Comms Service Provider
6%
Financial Services Firm
17%
Manufacturing Company
13%
Retailer
11%
Computer Software Company
10%
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What is your experience regarding pricing and costs for Apache Spark?
Apache Spark is open-source, so it doesn't incur any charges.
What needs improvement with Apache Spark?
There is complexity when it comes to understanding the whole ecosystem, especially for beginners. I find it quite complex to understand how a Spark job is initiated, the roles of driver nodes, work...
What do you like most about Google Cloud Dataflow?
The product's installation process is easy...The tool's maintenance part is somewhat easy.
What is your experience regarding pricing and costs for Google Cloud Dataflow?
Pricing is normal. It is part of a package received from Google, and they are not charging us too high.
What needs improvement with Google Cloud Dataflow?
I am not sure, as we built only one job, and it is running on a daily basis. Everything else is managed using BigQuery schedulers and Talend. However, occasionally, dealing with a huge volume of da...
 

Also Known As

No data available
Google Dataflow
 

Overview

 

Sample Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Absolutdata, Backflip Studios, Bluecore, Claritics, Crystalloids, Energyworx, GenieConnect, Leanplum, Nomanini, Redbus, Streak, TabTale
Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop. Updated: May 2025.
851,174 professionals have used our research since 2012.