Try our new research platform with insights from 80,000+ expert users

AWS Lambda vs Apache Spark vs Google Cloud Dataflow comparison

 

Comparison Buyer's Guide

Executive Summary

Review summaries and opinions

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Mindshare comparison

Hadoop
Compute Service
Streaming Analytics
 

Featured Reviews

Dunstan Matekenya - PeerSpot reviewer
Open-source solution for data processing with portability
Apache Spark is known for its ease of use. Compared to other available data processing frameworks, it is user-friendly. While many choices now exist, Spark remains easy to use, particularly with Python. You can utilize familiar programming styles similar to Pandas in Python, including object-oriented programming. Another advantage is its portability. I can prototype and perform some initial tasks on my laptop using Spark without needing to be on Databricks or any cloud platform. I can transfer it to Databricks or other platforms, such as AWS. This flexibility allows me to improve processing even on my laptop. For instance, if I'm processing large amounts of data and find my laptop becoming slow, I can quickly switch to Spark. It handles small and large datasets efficiently, making it a versatile tool for various data processing needs.
Andrew-Wong - PeerSpot reviewer
Convenience in deployment process with room for code preview improvement
Having a better preview would be helpful. Sometimes, if my Lambda code is too big, it can be inconvenient as I'm unable to see my code when it exceeds a certain size. AWS has a limit, like a three-megabyte limit, beyond which I cannot view or edit the code easily.
Jana Polianskaja - PeerSpot reviewer
Build Scalable Data Pipelines with Apache Beam and Google Cloud Dataflow
As a data engineer, I find several features of Google Cloud Dataflow particularly valuable. The ability to test solutions locally using Direct Runner is crucial for development, allowing me to validate pipelines without incurring the costs of full Dataflow jobs. The unified programming model for both batch and streaming processing is exceptional - requiring only minor code adjustments to optimize for either mode. This flexibility extends to language support, with robust implementations in both Java and Python, allowing teams to leverage their existing expertise. The platform's comprehensive monitoring capabilities are another standout feature. The intuitive interface, Grafana integration, and extensive service connectivity make troubleshooting and performance tracking highly efficient. Furthermore, seamless integration with Google Cloud Composer (managed Airflow) enables sophisticated orchestration of data pipelines.

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Pros

"With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."
"The main feature that we find valuable is that it is very fast."
"DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort."
"Apache Spark provides a very high-quality implementation of distributed data processing."
"Spark is used for transformations from large volumes of data, and it is usefully distributed."
"One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."
"Apache Spark is known for its ease of use. Compared to other available data processing frameworks, it is user-friendly."
"It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."
"The most valuable features are event-based triggers. They're really good for a reactive style when you want things to happen as soon as something else happens."
"We have no issues with the technical support."
"It's a fairly easy solution to learn."
"The main features of this solution are the ability to integrate multiple AWS applications or external applications very quickly and organize all of them. Additionally, it is easy to use and you can run various programming languages, such as Python, Go, and Java."
"The initial setup is pretty easy."
"I can use the solution to configure and set up all the requirements for testing the application and test code."
"The most valuable feature is that there is no need to implement it in a server because it is a service."
"The most valuable features of AWS Lambda are a serverless and event-driven architecture."
"The support team is good and it's easy to use."
"The solution allows us to program in any language we desire."
"It allows me to test solutions locally using runners like Direct Runner without having to start a Dataflow job, which can be costly."
"The integration within Google Cloud Platform is very good."
"The best feature of Google Cloud Dataflow is its practical connectedness."
"I would rate the overall solution a ten out of ten."
"The most valuable features of Google Cloud Dataflow are scalability and connectivity."
"The most valuable features of Google Cloud Dataflow are the integration, it's very simple if you have the complete stack, which we are using. It is overall very easy to use, user-friendly friendly, and cost-effective if you know how to use it. The solution is very flexible for programmers, if you know how to do scripts or program in Python or any other language, it's extremely easy to use."
 

Cons

"The Spark solution could improve in scheduling tasks and managing dependencies."
"When using Spark, users may need to write their own parallelization logic, which requires additional effort and expertise."
"The solution needs to optimize shuffling between workers."
"Apache Spark lacks geospatial data."
"The solution must improve its performance."
"The product could improve the user interface and make it easier for new users."
"The solution’s integration with other platforms should be improved."
"Apache Spark provides very good performance The tuning phase is still tricky."
"The metrics and reporting for this solution could be improved."
"If you're running a new application with a significant load, you need to be prepared for potential bottlenecks."
"The 60 seconds limitation with the consumption of the service is really restrictive for a service and the solution can be improved by eliminating that."
"We need to better understand Lambda for different scenarios. We need some joint effort between Amazon and the users to have the users identify how they can really leverage Lambda. It's not about Lambda itself; it's about the practice, the guidance. There needs to be very good documentation. From the user perspective, what exists now is not always enough."
"The runtime could be improved. There are certain use cases where I need a Lambda function to run longer."
"Lambda could be improved in the sense that some of the things done with Lambda function take some time. So the performance could be better and faster."
"It could be cheaper."
"The first time Lambda is started up, it takes some time to spin up an instance for serving the consumer requests. AWS has been trying to solve this in a variety of ways but have not yet managed to do so."
"When I deploy the product in local errors, a lot of errors pop up which are not always caught. The solution's error logging is bad. It can take a lot of time to debug the errors. It needs to have better logs."
"I would like to see improvements in consistency and flexibility for schema design for NoSQL data stored in wide columns."
"The authentication part of the product is an area of concern where improvements are required."
"Google Cloud Dataflow should include a little cost optimization."
"Promoting the technology more broadly would help increase its adoption."
"The solution's setup process could be more accessible."
"Google Cloud Data Flow can improve by having full simple integration with Kafka topics. It's not that complicated, but it could improve a bit. The UI is easy to use but the experience could be better. There are other tools available that do a better job."
"The deployment time could also be reduced."
 

Pricing and Cost Advice

"Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."
"I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten."
"Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure. If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources."
"It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project."
"It is an open-source platform. We do not pay for its subscription."
"Considering the product version used in my company, I feel that the tool is not costly since the product is available for free."
"Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera."
"It is an open-source solution, it is free of charge."
"Price-wise, AWS Lambda is very cheap. It's not free, but it's not that expensive."
"AWS Lambda license is paid on a monthly basis."
"The pricing is on-demand and based on runs or times that are billed out monthly."
"It costs maybe less than $10 per month in my use case."
"For licensing, we pay a yearly subscription."
"The price is expensive and is based on usage. The more users you have the higher the cost."
"It computes by the cycle, and it's very cheap."
"I think the price is okay. However, if they add more functionality, they can have better prices. In fact, they should have better and more flexible packages for clients who have greater consumption of Lambda."
"The price of the solution depends on many factors, such as how they pay for tools in the company and its size."
"On a scale from one to ten, where one is cheap, and ten is expensive, I rate the solution's pricing a seven to eight out of ten."
"The solution is not very expensive."
"On a scale from one to ten, where one is cheap, and ten is expensive, I rate Google Cloud Dataflow's pricing a four out of ten."
"Google Cloud Dataflow is a cheap solution."
"The solution is cost-effective."
"The tool is cheap."
"Google Cloud is slightly cheaper than AWS."
report
Use our free recommendation engine to learn which Hadoop solutions are best for your needs.
858,649 professionals have used our research since 2012.
 

Top Industries

By visitors reading reviews
Financial Services Firm
27%
Computer Software Company
13%
Manufacturing Company
7%
Comms Service Provider
6%
Educational Organization
51%
Financial Services Firm
11%
Computer Software Company
7%
Manufacturing Company
5%
Financial Services Firm
18%
Manufacturing Company
12%
Retailer
11%
Computer Software Company
10%
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What is your experience regarding pricing and costs for Apache Spark?
Apache Spark is open-source, so it doesn't incur any charges.
What needs improvement with Apache Spark?
There is complexity when it comes to understanding the whole ecosystem, especially for beginners. I find it quite com...
Which is better, AWS Lambda or Batch?
AWS Lambda is a serverless solution. It doesn’t require any infrastructure, which allows for cost savings. There is n...
What do you like most about AWS Lambda?
The tool scales automatically based on the number of incoming requests.
What is your experience regarding pricing and costs for AWS Lambda?
The pricing of AWS Lambda is reasonable. It's beneficial and cost-effective for users regardless of the number of ins...
What do you like most about Google Cloud Dataflow?
The product's installation process is easy...The tool's maintenance part is somewhat easy.
What is your experience regarding pricing and costs for Google Cloud Dataflow?
Pricing is normal. It is part of a package received from Google, and they are not charging us too high.
What needs improvement with Google Cloud Dataflow?
I am not sure, as we built only one job, and it is running on a daily basis. Everything else is managed using BigQuer...
 

Comparisons

 

Also Known As

No data available
No data available
Google Dataflow
 

Overview

 

Sample Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Netflix
Absolutdata, Backflip Studios, Bluecore, Claritics, Crystalloids, Energyworx, GenieConnect, Leanplum, Nomanini, Redbus, Streak, TabTale
Find out what your peers are saying about Apache, Cloudera, Amazon Web Services (AWS) and others in Hadoop. Updated: June 2025.
858,649 professionals have used our research since 2012.