Apache Hadoop vs Apache Spark comparison

Cancel
You must select at least 2 products to compare!
Apache Logo
2,630 views|2,223 comparisons
89% willing to recommend
Apache Logo
2,498 views|1,884 comparisons
89% willing to recommend
Comparison Buyer's Guide
Executive Summary

We performed a comparison between Apache Hadoop and Apache Spark based on real PeerSpot user reviews.

Find out what your peers are saying about Snowflake Computing, Oracle, Teradata and others in Data Warehouse.
To learn more, read our detailed Data Warehouse Report (Updated: April 2024).
768,857 professionals have used our research since 2012.
Q&A Highlights
Question: Which is the best RDMBS solution for big data?
Answer: I haven't used SQream personally. However, if you are only considering GPU based rdbms's please check the following https://hackernoon.com/which-gpu-database-is-right-for-me-6ceef6a17505
Featured Review
Quotes From Members
We asked business professionals to review the solutions they use.
Here are some excerpts of what they said:
Pros
"It's good for storing historical data and handling analytics on a huge amount of data.""​​Data ingestion: It has rapid speed, if Apache Accumulo is used.""What I like about Apache Hadoop is that it's for big data, in particular big data analysis, and it's the easier solution. I like the data processing feature for AI/ML use cases the most because some solutions allow me to collect data from relational databases, while Hadoop provides me with more options for newer technologies.""Hadoop is designed to be scalable, so I don't think that it has limitations in regards to scalability.""The solution is easy to expand. We haven't seen any issues with it in that sense. We've added 10 servers, and we've added two nodes. We've been expanding since we started using it since we started out so small. Companies that need to scale shouldn't have a problem doing so.""Most valuable features are HDFS and Kafka: Ingestion of huge volumes and variety of unstructured/semi-structured data is feasible, and it helps us to quickly onboard a new Big Data analytics prospect.""The scalability of Apache Hadoop is very good.""The tool's stability is good."

More Apache Hadoop Pros →

"I appreciate everything about the solution, not just one or two specific features. The solution is highly stable. I rate it a perfect ten. The solution is highly scalable. I rate it a perfect ten. The initial setup was straightforward. I recommend using the solution. Overall, I rate the solution a perfect ten.""With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.""The solution is very stable.""The deployment of the product is easy.""Spark can handle small to huge data and is suitable for any size of company.""Apache Spark provides a very high-quality implementation of distributed data processing.""AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI.""It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained."

More Apache Spark Pros →

Cons
"The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning.""It could be more user-friendly.""General installation/dependency issues were there, but were not a major, complex issue. While migrating data from MySQL to Hive, things are a little challenging, but we were able to get through that with support from forums and a little trial and error.""We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it.""I mentioned it definitely, and this is probably the only feature we can improve a little bit because the terminal and coding screen on Hadoop is a little outdated, and it looks like the old C++ bio screen. If the UI and UX can be improved slightly, I believe it will go a long way toward increasing adoption and effectiveness.""The solution is not easy to use. The solution should be easy to use and suitable for almost any case connected with the use of big data or working with large amounts of data.""The integration with Apache Hadoop with lots of different techniques within your business can be a challenge.""Real-time data processing is weak. This solution is very difficult to run and implement."

More Apache Hadoop Cons →

"In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that.""If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation.""The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate.""Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.""Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.""The logging for the observability platform could be better.""The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive.""It requires overcoming a significant learning curve due to its robust and feature-rich nature."

More Apache Spark Cons →

Pricing and Cost Advice
  • "Do take into consider that data storage and compute capacity scale differently and hence purchasing a "boxed" / 'all-in-one" solution (software and hardware) might not be the best idea."
  • "​There are no licensing costs involved, hence money is saved on the software infrastructure​."
  • "This is a low cost and powerful solution."
  • "The price of Apache Hadoop could be less expensive."
  • "If my company can use the cloud version of Apache Hadoop, particularly the cloud storage feature, it would be easier and would cost less because an on-premises deployment has a higher cost during storage, for example, though I don't know exactly how much Apache Hadoop costs."
  • "We don't directly pay for it. Our clients pay for it, and they usually don't complain about the price. So, it is probably acceptable."
  • "The price could be better. Hortonworks no longer exists, and Cloudera killed the free version of Hadoop."
  • "We just use the free version."
  • More Apache Hadoop Pricing and Cost Advice →

  • "Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free."
  • "Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."
  • "We are using the free version of the solution."
  • "Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera."
  • "Apache Spark is an expensive solution."
  • "Spark is an open-source solution, so there are no licensing costs."
  • "On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing."
  • "It is an open-source solution, it is free of charge."
  • More Apache Spark Pricing and Cost Advice →

    report
    Use our free recommendation engine to learn which Data Warehouse solutions are best for your needs.
    768,857 professionals have used our research since 2012.
    Answers from the Community
    Anonymous User
    Yuval Klein - PeerSpot reviewerYuval Klein
    Real User

    SQreamDB is a GPU DB. It is not suitable for real-time oltp of course.

    Cassandra is best suited for OLTP database use cases, when you need a scalable database (instead of SQL server, Postgres)
    SQream is a GPU database suited for OLAP purposes. It's the best suite for a very large data warehouse, very large queries needed mass parallel activity since GPU is great in massive parallel workload.

    Also, SQream is quite cheap since we need only one server with a GPU card, the best GPU card the better since we will have more CPU activity. It's only for a very big data warehouse, not for small ones.

    Tristan Bergh - PeerSpot reviewerTristan Bergh
    Real User

    Your best DB for 40+ TB is Apache Spark, Drill and the Hadoop stack, in the cloud.

    Use the public cloud provider's elastic store (S3, Azure BLOB, google drive) and then stand up Apache Spark on a cluster sized to run your queries within 20 minutes. Based on my experience (Azure BLOB store, Databricks, PySpark) you may need around 500 32GB nodes for reading 40 TB of data.

    Costs can be contained by running your own clusters but Databricks manage clusters for you.

    I would recommend optimizing your 40TB data store into the Databricks delta format after an initial parse.

    Russell Rothstein - PeerSpot reviewerRussell Rothstein (PeerSpot)
    Vendor

    Morten, the most popular comparisons of SQream can be found here: www.itcentralstation.com
    The top ones include Cassandra, MemSQL, MongoDB, and Vertica.

    Questions from the Community
    Top Answer:Tools like Apache Hadoop are knowledge-intensive in nature. Unlike other tools in the market currently, we cannot understand knowledge-intensive products straight away. To use Apache Hadoop, a person… more »
    Top Answer:We use Spark to process data from different data sources.
    Top Answer:In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond
    Ranking
    5th
    out of 34 in Data Warehouse
    Views
    2,630
    Comparisons
    2,223
    Reviews
    11
    Average Words per Review
    532
    Rating
    8.0
    1st
    out of 22 in Hadoop
    Views
    2,498
    Comparisons
    1,884
    Reviews
    25
    Average Words per Review
    432
    Rating
    8.7
    Comparisons
    Learn More
    Overview
    The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

    Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

    Sample Customers
    Amazon, Adobe, eBay, Facebook, Google, Hulu, IBM, LinkedIn, Microsoft, Spotify, AOL, Twitter, University of Maryland, Yahoo!, Cornell University Web Lab
    NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
    Top Industries
    REVIEWERS
    Financial Services Firm38%
    Comms Service Provider25%
    Hospitality Company6%
    Consumer Goods Company6%
    VISITORS READING REVIEWS
    Financial Services Firm27%
    Computer Software Company10%
    Comms Service Provider6%
    University6%
    REVIEWERS
    Computer Software Company30%
    Financial Services Firm15%
    University9%
    Marketing Services Firm6%
    VISITORS READING REVIEWS
    Financial Services Firm25%
    Computer Software Company13%
    Manufacturing Company7%
    Comms Service Provider6%
    Company Size
    REVIEWERS
    Small Business34%
    Midsize Enterprise23%
    Large Enterprise43%
    VISITORS READING REVIEWS
    Small Business15%
    Midsize Enterprise11%
    Large Enterprise75%
    REVIEWERS
    Small Business40%
    Midsize Enterprise19%
    Large Enterprise40%
    VISITORS READING REVIEWS
    Small Business17%
    Midsize Enterprise12%
    Large Enterprise71%
    Buyer's Guide
    Data Warehouse
    April 2024
    Find out what your peers are saying about Snowflake Computing, Oracle, Teradata and others in Data Warehouse. Updated: April 2024.
    768,857 professionals have used our research since 2012.

    Apache Hadoop is ranked 5th in Data Warehouse with 32 reviews while Apache Spark is ranked 1st in Hadoop with 60 reviews. Apache Hadoop is rated 7.8, while Apache Spark is rated 8.4. The top reviewer of Apache Hadoop writes "A file system for data collection that contains needed information and files". On the other hand, the top reviewer of Apache Spark writes "Reliable, able to expand, and handle large amounts of data well". Apache Hadoop is most compared with Azure Data Factory, Microsoft Azure Synapse Analytics, Oracle Exadata, Snowflake and Teradata, whereas Apache Spark is most compared with Spring Boot, AWS Batch, Spark SQL, SAP HANA and Cloudera Distribution for Hadoop.

    We monitor all Data Warehouse reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. We validate each review for authenticity via cross-reference with LinkedIn, and personal follow-up with the reviewer when necessary.