Try our new research platform with insights from 80,000+ expert users

Apache Spark vs Google Cloud Dataflow comparison

 

Comparison Buyer's Guide

Executive Summary

Review summaries and opinions

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

ROI

Sentiment score
7.3
Apache Spark reduces operational costs by up to 50%, offering high ROI and efficient performance despite infrastructure expenses.
Sentiment score
7.2
Google Cloud Dataflow offers substantial cost savings and efficiencies, with organizations experiencing 70% time savings and clear financial benefits.
 

Customer Service

Sentiment score
6.1
Apache Spark support ranges from vibrant community help to paid vendor plans, with experiences varying based on user needs.
Sentiment score
7.9
Google Cloud Dataflow customer support experiences vary from slow to effective, with proactive updates and dedicated managers enhancing service.
The fact that no interaction is needed shows their great support since I don't face issues.
Google's support team is good at resolving issues, especially with large data.
Whenever we have issues, we can consult with Google.
 

Scalability Issues

Sentiment score
7.7
Apache Spark is scalable, efficiently manages large workloads, and is praised for stability, adaptability, and expansive capabilities.
Sentiment score
7.3
Google Cloud Dataflow is highly rated for scalability, handling large data loads seamlessly and offering dynamic resource optimization.
Google Cloud Dataflow has auto-scaling capabilities, allowing me to add different machine types based on pace and requirements.
As a team lead, I'm responsible for handling five to six applications, but Google Cloud Dataflow seems to handle our use case effectively.
Google Cloud Dataflow can handle large data processing for real-time streaming workloads as they grow, making it a good fit for our business.
 

Stability Issues

Sentiment score
7.5
Apache Spark is stable and reliable, with improved versions addressing issues, widely used by major tech companies.
Sentiment score
8.2
Google Cloud Dataflow is reliable and stable, with automatic scaling and minor errors in complex, long-running tasks.
I have not encountered any issues with the performance of Dataflow, as it is stable and backed by Google services.
The job we built has not failed once over six to seven months.
The automatic scaling feature helps maintain stability.
 

Room For Improvement

Google Cloud Dataflow improves integrations, but faces challenges in SDK features, support, authentication, cost, and scalability.
Outside of Google Cloud Platform, it is problematic for others to use it and may require promotion as an actual technology.
I would like to see improvements in consistency and flexibility for schema design for NoSQL data stored in wide columns.
Dealing with a huge volume of data causes failure due to array size.
 

Setup Cost

Google Cloud Dataflow is cost-effective and competitive, with expenses aligned to usage, often cheaper than AWS.
It is part of a package received from Google, and they are not charging us too high.
 

Valuable Features

Google Cloud Dataflow offers seamless integration, flexibility, scalability, cost-effectiveness, and powerful event stream processing for real-time insights.
It supports multiple programming languages such as Java and Python, enabling flexibility without the need to learn something new.
The integration within Google Cloud Platform is very good.
Google Cloud Dataflow's features for event stream processing allow us to gain various insights like detecting real-time alerts.
 

Categories and Ranking

Apache Spark
Average Rating
8.4
Reviews Sentiment
7.4
Number of Reviews
66
Ranking in other categories
Hadoop (1st), Compute Service (4th), Java Frameworks (2nd)
Google Cloud Dataflow
Average Rating
8.0
Reviews Sentiment
7.3
Number of Reviews
13
Ranking in other categories
Streaming Analytics (7th)
 

Mindshare comparison

Apache Spark and Google Cloud Dataflow aren’t in the same category and serve different purposes. Apache Spark is designed for Hadoop and holds a mindshare of 18.3%, down 20.4% compared to last year.
Google Cloud Dataflow, on the other hand, focuses on Streaming Analytics, holds 6.5% mindshare, down 7.5% since last year.
Hadoop
Streaming Analytics
 

Featured Reviews

Dunstan Matekenya - PeerSpot reviewer
Open-source solution for data processing with portability
Apache Spark is known for its ease of use. Compared to other available data processing frameworks, it is user-friendly. While many choices now exist, Spark remains easy to use, particularly with Python. You can utilize familiar programming styles similar to Pandas in Python, including object-oriented programming. Another advantage is its portability. I can prototype and perform some initial tasks on my laptop using Spark without needing to be on Databricks or any cloud platform. I can transfer it to Databricks or other platforms, such as AWS. This flexibility allows me to improve processing even on my laptop. For instance, if I'm processing large amounts of data and find my laptop becoming slow, I can quickly switch to Spark. It handles small and large datasets efficiently, making it a versatile tool for various data processing needs.
Jana Polianskaja - PeerSpot reviewer
Build Scalable Data Pipelines with Apache Beam and Google Cloud Dataflow
As a data engineer, I find several features of Google Cloud Dataflow particularly valuable. The ability to test solutions locally using Direct Runner is crucial for development, allowing me to validate pipelines without incurring the costs of full Dataflow jobs. The unified programming model for both batch and streaming processing is exceptional - requiring only minor code adjustments to optimize for either mode. This flexibility extends to language support, with robust implementations in both Java and Python, allowing teams to leverage their existing expertise. The platform's comprehensive monitoring capabilities are another standout feature. The intuitive interface, Grafana integration, and extensive service connectivity make troubleshooting and performance tracking highly efficient. Furthermore, seamless integration with Google Cloud Composer (managed Airflow) enables sophisticated orchestration of data pipelines.
report
Use our free recommendation engine to learn which Hadoop solutions are best for your needs.
860,592 professionals have used our research since 2012.
 

Top Industries

By visitors reading reviews
Financial Services Firm
27%
Computer Software Company
12%
Manufacturing Company
7%
Comms Service Provider
6%
Financial Services Firm
18%
Manufacturing Company
12%
Retailer
11%
Computer Software Company
10%
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What is your experience regarding pricing and costs for Apache Spark?
Apache Spark is open-source, so it doesn't incur any charges.
What needs improvement with Apache Spark?
There is complexity when it comes to understanding the whole ecosystem, especially for beginners. I find it quite complex to understand how a Spark job is initiated, the roles of driver nodes, work...
What do you like most about Google Cloud Dataflow?
The product's installation process is easy...The tool's maintenance part is somewhat easy.
What is your experience regarding pricing and costs for Google Cloud Dataflow?
Pricing is normal. It is part of a package received from Google, and they are not charging us too high.
What needs improvement with Google Cloud Dataflow?
I am not sure, as we built only one job, and it is running on a daily basis. Everything else is managed using BigQuery schedulers and Talend. However, occasionally, dealing with a huge volume of da...
 

Also Known As

No data available
Google Dataflow
 

Overview

 

Sample Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Absolutdata, Backflip Studios, Bluecore, Claritics, Crystalloids, Energyworx, GenieConnect, Leanplum, Nomanini, Redbus, Streak, TabTale
Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop. Updated: June 2025.
860,592 professionals have used our research since 2012.