The primary use is as a data lake.
Powerful data ingestion and consolidation tools prepare the data for predictive analytics
Pros and Cons
- "The most valuable features are powerful tools for ingestion, as data is in multiple systems."
- "It would be helpful to have more information on how to best apply this solution to smaller organizations, with less data, and grow the data lake."
What is our primary use case?
How has it helped my organization?
Using this solution has allowed us to consolidate the data. It has made it such that data science-based algorithms can be written for predictive analytics.
What is most valuable?
The most valuable features are powerful tools for ingestion, as data is in multiple systems.
What needs improvement?
It would be helpful to have more information on how to best apply this solution to smaller organizations, with less data, and grow the data lake.
Buyer's Guide
Apache Hadoop
June 2025

Learn what your peers think about Apache Hadoop. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
857,028 professionals have used our research since 2012.
For how long have I used the solution?
I have been using Apache Hadoop for two years.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Technical Architect at RBSG Internet Operations
Good database and highly scalable, with good plug and play analytics tools
Pros and Cons
- "The most valuable feature is the database."
- "It would be good to have more advanced analytics tools."
What is our primary use case?
We are primarily dumping all the prior payment transaction data into a loop system and then we use some of the plug and play analytics tools to translate it.
What is most valuable?
The most valuable feature is the database.
What needs improvement?
We're finding vulnerabilities in running it 24/7. We're experiencing some downtime that affects the data.
It would be good to have more advanced analytics tools.
For how long have I used the solution?
I've been using the solution for five years.
What do I think about the scalability of the solution?
The solution is scalable. From a payments perspective, we're using the solution on a large scale.
How are customer service and technical support?
We've never contacted technical support.
Which solution did I use previously and why did I switch?
We didn't previously use a different solution.
How was the initial setup?
The initial setup was complex. There was a lot of data that we had to bring over from various sources and it was quite a long process.
What about the implementation team?
We did have some assistance with the implementation.
What other advice do I have?
We use the on-premises deployment model.
We're more inclined towards an operational data source to fill our customer's needs. Hadoop is good for analytics and some reporting requirements.
It's a good solution for those needing something for the purposes of management reporting.
I'd rate the solution eight out of ten.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Buyer's Guide
Apache Hadoop
June 2025

Learn what your peers think about Apache Hadoop. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
857,028 professionals have used our research since 2012.
Practice Lead (BI/ Data Science) at a tech services company with 11-50 employees
Good for managing and replication of big data but needs a better user interface
Pros and Cons
- "It's good for storing historical data and handling analytics on a huge amount of data."
- "The solution could use a better user interface. It needs a more effective GUI in order to create a better user environment."
What is most valuable?
The solution is perfect for when you have big data. It's good for managing and replication.
It's good for storing historical data and handling analytics on a huge amount of data.
What needs improvement?
It could be because the solution is open source, and therefore not funded like bigger companies, but we find the solution runs slow.
The solution isn't as mature as SQL or Oracle and therefore lacks many features.
The solution could use a better user interface. It needs a more effective GUI in order to create a better user environment.
For how long have I used the solution?
I've been using the solution for seven years.
What do I think about the stability of the solution?
The solution is stable.
What other advice do I have?
I've used the solution under cloud, hybrid and on-premises deployment models.
I'd recommend the solution, but it depends on the company's requirements. If you don't have huge amounts of data, you probably don't need Hadoop. If you need a completely private environment, and you have lots of big data, consider Hadoop. You don't even need to invest in the infrastructure as you can just use a cloud deployment.
I'd rate the solution seven out of ten. I'd rate it higher if it had a better user interface.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
CEO at AM-BITS LLC
Good stability and scalability but the visualization isn't good
Pros and Cons
- "The ability to add multiple nodes without any restriction is the solution's most valuable aspect."
- "There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution."
What is our primary use case?
We primarily use the solution for the enterprise data hub and big data warehouse extension.
What is most valuable?
The ability to add multiple nodes without any restriction is the solution's most valuable aspect.
What needs improvement?
What needs improvement depends on the customer and the use case. The classical Hadoop, for example, we consider an old variant. Most now work with flash data.
There is a very wide application for this solution, but in enterprise companies, if you work with classical BI systems, it would be good to include an additional presentation layer for BI solutions.
There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution.
For how long have I used the solution?
We've been working with the solution for three to four years.
What do I think about the stability of the solution?
The solution is stable. It has very good disaster stability and multi-rack configuration.
What do I think about the scalability of the solution?
It is possible to scale the solution. We work with companies that have hundreds of users.
How was the initial setup?
The initial setup might not be straightforward for our customers, but it's easy enough for us to handle. However, if we don't build a proof of concept for the company first it may take some time and be quite complex. Pilot projects take about three months to deploy and full spec projects take up to a year because we have to work in all requirements in data governance, security, etc.
What's my experience with pricing, setup cost, and licensing?
We originally built on Hortonworks tech which didn't require any licensing, but that is getting discontinued in 2022, so it's been proposed we move to Cloudera which will have licensing costs associated with it.
What other advice do I have?
We use the on-premises deployment model. It's a requirement for the company we work with, which is a bank. Often customers demand we work with on-premises deployment models.
I'd rate the solution seven out of ten. In terms of the ability to build middleware and offer scalability, it would be 10 out of 10 from me. However, if you take into account only the visualization, I'd only rate it at three or four out of ten.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Data Engineer at BBD
Good standard features, but a small local-machine version would be useful
Pros and Cons
- "What comes with the standard setup is what we mostly use, but Ambari is the most important."
- "In the next release, I would like to see Hive more responsive for smaller queries and to reduce the latency."
What is our primary use case?
The primary use case of this solution is data engineering and data files.
The deployment model we are using is private, on-premises.
What is most valuable?
We don't use many of the Hadoop features, like Pig, or Sqoop, but what I like most is using the Ambari feature. You have to use Ambari otherwise it is very difficult to configure.
What comes with the standard setup is what we mostly use, but Ambari is the most important.
What needs improvement?
Hadoop itself is quite complex, especially if you want it running on a single machine, so to get it set up is a big mission.
It seems that Hadoop is on it's way out and Spark is the way to go. You can run Spark on a single machine and it's easier to setup.
In the next release, I would like to see Hive more responsive for smaller queries and to reduce the latency. I don't think that this is viable, but if it is possible, then latency on smaller guide queries for analysis and analytics.
I would like a smaller version that can be run on a local machine. There are installations that do that but are quite difficult, so I would say a smaller version that is easy to install and explore would be an improvement.
For how long have I used the solution?
I have been using this solution for one year.
What do I think about the stability of the solution?
This solution is stable but sometimes starting up can be quite a mission. With a full proper setup, it's fine, but it's a lot of work to look after, and to startup and shutdown.
What do I think about the scalability of the solution?
This solution is scalable, and I can scale it almost indefinitely.
We have approximately two thousand users, half of the users are using it directly and another thousand using the products and systems running on it. Fifty are data engineers, fifteen direct appliances, and the rest are business users.
How are customer service and technical support?
There are several forums on the web, and Google search works fine. There is a lot of information available and it often works.
They also have good support in regards to the implementation.
I am satisfied with the support. Generally, there is good support.
Which solution did I use previously and why did I switch?
We used the more traditional database solutions such as SAP IQ and Data Marks, but now it's changing more towards Data Science and Big Data.
We are a smaller infrastructure, so that's how we are set up.
How was the initial setup?
The initial setup is quite complex if you have to set it up yourself. Ambari makes it much easier, but on the cloud or local machines, it's quite a process.
It took at least a day to set it up.
What about the implementation team?
I did not use a vendor. I implemented it myself on the cloud with my local machine.
Which other solutions did I evaluate?
There was an evaluation, but it was a decision to implement with Data Lake and Hortonworks data platform.
What other advice do I have?
It's good for what is meant to do, a lot of big data, but it's not as good for low latency applications.
If you have to perform quick queries on naive or analytics it can be frustrating.
It can be useful for what it was intended to be used for.
I would rate this solution a seven out of ten.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
IT Expert at a tech services company with 1,001-5,000 employees
An inexpensive and flexible suite that helps users integrate varied legacy systems
Pros and Cons
- "The best thing about this solution is that it is very powerful and very cheap."
- "The upgrade path should be improved because it is not as easy as it should be."
What is our primary use case?
We primarily use this product to integrate legacy systems.
How has it helped my organization?
It helps us work with older products and more easily create solutions.
What is most valuable?
The most valuable thing about this program for us is that it is very powerful and very cheap. We're using a lot of the program's modules and features because we're using software and hardware that can be difficult to integrate. For example, we're using supersets and a lot of old products from difficult systems. We love having the various options and features that allow us to work with flexibility.
What needs improvement?
We are using HDTM circuit boards, and I worry about the future of this product and compatibility with future releases. It's a concern because, for now, we do not have a clear path to upgrade. The Hadoop product is in version three and we'd like to upgrade to the third version. But as far as I know, it's not a simple thing.
There are a lot of features in this product that are open-source. If something isn't included with the distribution we are not limited. We can take things from the internet and integrate them. As far as I know, we are using Presto which isn't included in HDP (Hortonworks Data Platform) and it works fine. Not everything has to be included in the release. If something is outside of HDP and it works, that is good enough for me. We have the flexibility to incorporate it ourselves.
For how long have I used the solution?
We have been using the product for about five years.
What do I think about the stability of the solution?
The product is well tested and very stable. We have no problems with the stability of it at all. Really we just install it and forget about fussing with it. We just use the features it offers to be productive.
What do I think about the scalability of the solution?
This is a scalable solution and we like what it does. It is currently serving about 100 users at our organization and it seems like it can handle more easily.
How are customer service and technical support?
We actually have not used technical support. Everything we needed a solution for we just use Google and it's enough for us. Sometimes we do have issues, but not often. The issues are mainly to do with the terminals because it's a bit complicated to integrate these other systems. We have managed to solve all the problems up till now.
Which solution did I use previously and why did I switch?
We had a very old version of Hadoop which was already installed by another company and we upgraded it. We didn't really switch we just upgraded what was here.
How was the initial setup?
The initial setup wasn't very easy because of the incredible security, but we have managed to get by that. It's sort of simple, in my opinion, once you get past that part. I think, in all, it took about half of a year. But it wasn't a new deployment, it's an upgrade and the bigger challenge was moving the data. We pretty much just supported the existing product and moved to HDP.
What about the implementation team?
We have everything on-premises and we did the deployment and maintenance.
It took four people. We want to increase usage of Hadoop and we are thinking about it very heavily. We're actually in the process of doing it. At the same time, we are integrating things from other systems to Hadoop.
What other advice do I have?
I would give this product a rating of eight out of ten. It would not be a ten out of ten because of some problems we are having with the upgrade to the newer version. It would have been better for us if these problems were not holding us back. I think eight is good enough.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Works
Reduces cost, saves time, and provides insight into our unstructured data
Pros and Cons
- "The most valuable features are the ability to process the machine data at a high speed, and to add structure to our data so that we can generate relevant analytics."
- "We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it."
What is our primary use case?
We use this solution for our Enterprise Data Lake.
How has it helped my organization?
Using this solution has reduced the overall TCO. It has also improved data processing time for the machine and provides greater insight into our unstructured data.
What is most valuable?
The most valuable features are the ability to process the machine data at a high speed, and to add structure to our data so that we can generate relevant analytics.
What needs improvement?
We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it.
For how long have I used the solution?
More than four years.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Analytics Platform Manager at a consultancy with 10,001+ employees
Parallel processing allows us to get jobs done, but the platform needs more direct integration of visualization applications
Pros and Cons
- "Two valuable features are its scalability and parallel processing. There are jobs that cannot be done unless you have massively parallel processing."
- "I would like to see more direct integration of visualization applications."
What is our primary use case?
We use it as a data lake for streaming analytical dashboards.
How has it helped my organization?
There is a lot of difference. I think the best case is that we are able to drill down to transactional records and really build a root-cause analysis for various issues that might arise, on demand. Because we're able to process in parallel, we don't have to wait for the big data warehouse engine. We process down what the data is and then build it up to an answer, and we can have an answer in an hour rather than 10 hours.
What is most valuable?
- Scalability
- Parallel processing
There are jobs that cannot be done unless you have massively parallel processing; for instance, processing call-detail records for telecom.
What needs improvement?
In general, Hadoop has as lot of different component parts to the platform - things like Hive and HBase - and they're all moving somewhat independently and somewhat in parallel. I think as you look to platforms in the cloud or into walled-garden concepts, like Cloudera or Azure, you see that the third-party can make sure all the components work together before they are used for business purposes. That reduces a layer of administration configuration and technical support.
I would like to see more direct integration of visualization applications.
For how long have I used the solution?
More than five years.
What do I think about the stability of the solution?
In general, stability can be a challenge. It's hard to say what stability means. You're in an environment that's before production-line manufacturing, where none of the parts relate together exactly as they should. So that can create some instability.
To realize the benefit of these kinds of open-source, big-data environments, you want to use as many different tools as you can get. That brings with it all this overhead of making them work together. It's kind of a blessing and a curse, at the same time: There's a tool for everything.
How are customer service and technical support?
Apache is the open-source foundation that Cloudera and Hortonworks contribute code and some work to. I don't know that there is actually support and structure, per se, for Apache.
We have had premium, at various times with various companies. From the three dominant companies I've worked with - Cloudera, Hortonworks, and MapR - there is a premium support package but that still only covers their base. Distribution is not necessarily all the add-ons that are on top of it, which is really a big challenge: to get everything to work together.
Which solution did I use previously and why did I switch?
There are the older relational database technologies: Netezza, SQL Server, MySQL, Oracle, Teradata. All have some advantages and some disadvantages. Most notably, they are all significantly more expensive in terms of the capital expense, rather than the operational expense. They are "walled-garden," so to speak, that are curated and have a distinct set of tools that work with them, and not the bleeding-edge ingenuity that comes with an open-source platform.
Data warehousing is 30 years old, at least. Big data is, in its current form, has only been around for four or five years old.
How was the initial setup?
There are capacities in which I have been responsible for setup, administration, and building the applications on those environments. Each of the components is relatively straightforward. The complexity comes from all the different components.
What other advice do I have?
Implement for defined use cases. Don't expect it to all just work very easily.
I would rate this platform a seven out of 10. On the one hand, it's the only place you can use certain functions, and on the other hand, it's not going to put any of the other ones out of business. It's really more of a complement. There is no fundamental battle between relational databases and Hadoop.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Buyer's Guide
Download our free Apache Hadoop Report and get advice and tips from experienced pros
sharing their opinions.
Updated: June 2025
Product Categories
Data WarehousePopular Comparisons
Teradata
Snowflake
Oracle Exadata
Vertica
VMware Tanzu Data Solutions
SAP BW4HANA
IBM Netezza Performance Server
Oracle Database Appliance
SAP IQ
IBM Db2 Warehouse
Oracle Big Data Appliance
Buyer's Guide
Download our free Apache Hadoop Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which data catalog can provide support for BI data sources such as SAP BO and Tableau?
- Which is the best RDMBS solution for big data?
- Apache Spark without Hadoop -- Is this recommended?
- What is the biggest difference between Apache Hadoop and Snowflake?
- Which solution is better for setting up a data lake: Apache Hadoop or Oracle Exadata?
- Oracle Exadata vs. HPE Vertica vs. EMC GreenPlum vs. IBM Netezza
- When evaluating Data Warehouse solutions, what aspect do you think is the most important to look for?
- At what point does a business typically invest in building a data warehouse?
- Is a data warehouse the best option to consolidate data into one location?
- What are the main differences between Data Lake and Data Warehouse?