Try our new research platform with insights from 80,000+ expert users
Real User
We are able to ingest huge volumes/varieties of data, but it needs a data visualization tool and enhanced Ambari for management
Pros and Cons
  • "Initially, with RDBMS alone, we had a lot of work and few servers running on-premise and on cloud for the PoC and incubation. With the use of Hadoop and ecosystem components and tools, and managing it in Amazon EC2, we have created a Big Data "lab" which helps us to centralize all our work and solutions into a single repository. This has cut down the time in terms of maintenance, development and, especially, data processing challenges."
  • "Since both Apache Hadoop and Amazon EC2 are elastic in nature, we can scale and expand on demand for a specific PoC, and scale down when it's done."
  • "Most valuable features are HDFS and Kafka: Ingestion of huge volumes and variety of unstructured/semi-structured data is feasible, and it helps us to quickly onboard a new Big Data analytics prospect."
  • "Based on our needs, we would like to see a tool for data visualization and enhanced Ambari for management, plus a pre-built IoT hub/model. These would reduce our efforts and the time needed to prove to a customer that this will help them."
  • "General installation/dependency issues were there, but were not a major, complex issue. While migrating data from MySQL to Hive, things are a little challenging, but we were able to get through that with support from forums and a little trial and error."

What is our primary use case?

Big Data analytics, customer incubation. 

We host our Big Data analytics "lab" on Amazon EC2. Customers are new to Big Data analytics so we do proofs of concept for them in this lab. Customers bring historical, structured data, or IoT data, or a blend of both. We ingest data from these sources into the Hadoop environment, build the analytics solution on top, and prove the value and define the roadmap for customers.

How has it helped my organization?

Initially, with RDBMS alone, we had a lot of work and few servers running on-premise and on cloud for the PoC and incubation. With the use of Hadoop and ecosystem components and tools, and managing it in Amazon EC2, we have created a Big Data "lab" which helps us to centralize all our work and solutions into a single repository. This has cut down the time in terms of maintenance, development and, especially, data processing challenges. 

We were using MySQL and PostgreSQL for these engagements, and scaling and processing were not as easy when compared to Hadoop. Also, customers who are embarking on a big journey with semi-structured information prefer to use Hadoop rather than a RDBMS stack. This gives them clarity on the requirements.

In addition, since both Apache Hadoop and Amazon EC2 are elastic in nature, we can scale and expand on demand for a specific PoC, and scale down when it's done.

Flexibility, ease of data processing, reduced cost and efforts are the three key improvements for us.

What is most valuable?

HDFS and Kafka: Ingestion of huge volumes and variety of unstructured/semi-structured data is feasible, and it helps us to quickly onboard a new Big Data analytics prospect.

What needs improvement?

Based on our needs, we would like to see a tool for data visualization and enhanced Ambari for management, plus a pre-built IoT hub/model. These would reduce our efforts and the time needed to prove to a customer that this will help them.

Buyer's Guide
Apache Hadoop
June 2025
Learn what your peers think about Apache Hadoop. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
857,028 professionals have used our research since 2012.

For how long have I used the solution?

Less than one year.

What do I think about the stability of the solution?

We have a three-node cluster running on cloud by default, and it has been stable so far without any stoppages due to Hadoop or other ecosystem components.

What do I think about the scalability of the solution?

Since this is primarily for customer incubation, there is a need to process huge volumes of data, based on the proof of value engagement. During these processes, we scale the number of instances on demand (using Amazon spot instances), use them for a defined period, and scale down when the PoC is done. This gives us good flexibility and we pay only for usage.

How are customer service and support?

Since this is mostly community driven, we get a lot of input from the forums and our in-house staff who are skilled in doing the job. So far, most of the issues we have had during setup or scaling have primarily been on the infrastructure side and not on the stack. For most of the problems we get answers from the community forums.

How was the initial setup?

We didn't have any major issues except for knowledge, so we hired the right person who had hands-on experience with this stack, and worked with the cloud provider to get the right mechanism for handling the stack.

General installation/dependency issues were there, but were not a major, complex issue. While migrating data from MySQL to Hive, things are a little challenging, but we were able to get through that with support from forums and a little trial and error. In addition, the old PoCs which were migrated had issues in directly connecting to Hive. We had to build some user functions to handle that.

What's my experience with pricing, setup cost, and licensing?

We normally do not suggest any specific distributions. When it comes to cloud, our suggestion would be to choose different types of instances offered by Amazon cloud, as we are technology partners of Amazon for cost savings. For all our PoCs, we stick to the default distribution.

Which other solutions did I evaluate?

None, as this stack is familiar to us and we were sure it could be used for such engagements without much hassle. Our primary criteria were the ability to migrate our existing RDBMS-based PoC and connectivity via our ETL and visualization tool. On top of that, support for semi-structured data for ELT. All three of these criteria were a fit with this stack.

What other advice do I have?

Our general suggestion to any customer is not to blindly look and compare different options. Rather, list the exact business needs - current and future - and then prepare a matrix to see product capabilities and evaluate costs and other compliance factors for that specific enterprise.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
PeerSpot user
Software Architect at a tech services company with 10,001+ employees
Real User
Gives us high throughput and low latency for KPI visualization
Pros and Cons
  • "High throughput and low latency. We start with data mashing on Hive and finally use this for KPI visualization."

    What is our primary use case?

    Data aggregation for KPIs. The sources of data come in all forms so the data is unstructured. We needed high storage and aggregation of data, in the background.

    How has it helped my organization?

    We start with data mashing on Hive and finally use this for KPI visualization. This intermediate step not only mashes data in the form that we want through data Cube slicing, but also helps us save states as snapshots for multiple time frames.

    Without this, we would have had to plan another data source for only this purpose. Moving this step closer to processing worked better than keeping it at visualization. Although we can't completely avoid using data stores/snapshots at visualization, this step proved to be promising for getting data ready for better analytics and insights.

    What is most valuable?

    High throughput and low latency. We start with data mashing on Hive and finally use this for KPI visualization.

    What needs improvement?

    At the beginning, MRs on Hive made me think we should get down to Hadoop MRs to have better control of the data. But later, Hive as a platform upgraded very well. I still think a Spark-type layer on top gives you an edge over having only Hive.

    For how long have I used the solution?

    Less than one year.

    What other advice do I have?

    I rate it an eight out of 10. It's huge, complex, slow. But does what it is meant for.

    Disclosure: My company does not have a business relationship with this vendor other than being a customer.
    PeerSpot user
    Buyer's Guide
    Apache Hadoop
    June 2025
    Learn what your peers think about Apache Hadoop. Get advice and tips from experienced pros sharing their opinions. Updated: June 2025.
    857,028 professionals have used our research since 2012.
    PeerSpot user
    Database/Middleware Consultant (Currently at U.S. Department of Labor) at a tech services company with 51-200 employees
    Consultant
    ​There are no licensing costs involved, hence money is saved on software infrastructure​
    Pros and Cons
    • "​​Data ingestion: It has rapid speed, if Apache Accumulo is used."
    • "It needs better user interface (UI) functionalities."

    What is our primary use case?

    • Content management solution
    • Unified Data solution
    • Apache Hadoop running on Linux

    What is most valuable?

    • Data ingestion: It has rapid speed, if Apache Accumulo is used.
    • Data security
    • Inexpensive

    What needs improvement?

    It needs better user interface (UI) functionalities.

    For how long have I used the solution?

    Three to five years.

    What's my experience with pricing, setup cost, and licensing?

    There are no licensing costs involved, hence money is saved on the software infrastructure.

    Disclosure: My company does not have a business relationship with this vendor other than being a customer.
    PeerSpot user
    Senior Associate at a financial services firm with 10,001+ employees
    Real User
    Relatively fast when reading data into other platforms but can't handle queries with insufficient memory
    Pros and Cons
    • "As compared to Hive on MapReduce, Impala on MPP returns results of SQL queries in a fairly short amount of time, and is relatively fast when reading data into other platforms like R."
    • "The key shortcoming is its inability to handle queries when there is insufficient memory. This limitation can be bypassed by processing the data in chunks."

    What is most valuable?

    Impala. As compared to Hive on MapReduce, Impala on MPP returns results of SQL queries in a fairly short amount of time, and is relatively fast when reading data into other platforms like R (for further data analysis) or QlikView (for data visualisation).

    How has it helped my organization?

    The quick access to data enabled more frequent data backed decisions.

    What needs improvement?

    The key shortcoming is its inability to handle queries when there is insufficient memory. This limitation can be bypassed by processing the data in chunks.

    For how long have I used the solution?

    Two-plus years.

    What do I think about the stability of the solution?

    Typically instability is experienced due to insufficient memory, either due to a large job being triggered or multiple concurrent small requests.

    What do I think about the scalability of the solution?

    No. This is by default a cluster-based setup and hence scaling is just a matter of adding on new data nodes.

    How are customer service and technical support?

    Not applicable to Cloudera. We have a separate onsite vendor to manage the cluster.

    Which solution did I use previously and why did I switch?

    No. Two years ago this was a new team and hence there were no legacy systems to speak of.

    How was the initial setup?

    Complex. Cloudera stack itself was insufficient. Integration with other tools like R and QlikView was required and in-house programs had to be built to create an automated data pipeline.

    What's my experience with pricing, setup cost, and licensing?

    Not much advice as pricing and licensing is handled at an enterprise level.

    However do take into consider that data storage and compute capacity scale differently and hence purchasing a "boxed" / 'all-in-one" solution (software and hardware) might not be the best idea.

    Which other solutions did I evaluate?

    Yes. Oracle Exadata and Teradata.

    What other advice do I have?

    Try open-source Hadoop first but be aware of greater implementation complexity. If open-source Hadoop is "too" complex, then consider a vendor packaged Hadoop solution like HortonWorks, Cloudera, etc.

    Disclosure: My company does not have a business relationship with this vendor other than being a customer.
    PeerSpot user
    it_user693231 - PeerSpot reviewer
    Big Data Engineer at a tech vendor with 5,001-10,000 employees
    Vendor
    HDFS allows you to store large data sets optimally. After switching to big data pipelines our query performances had improved hundred times.

    What is most valuable?

    HDFS allows you to store large data sets optimally.

    How has it helped my organization?

    After switching to big data pipelines, our query performance improved a hundred times.

    What needs improvement?

    Rolling restarts of data nodes need to be done in a way that can be further optimized. Also, I/O operations can be optimized for more performance.

    For how long have I used the solution?

    I have used Hadoop for over three years.

    What do I think about the stability of the solution?

    Once we had an issue with stability, due to a complete shutdown of a cluster. Bringing up a cluster took a lot of time because of some order that needed to be followed.

    What do I think about the scalability of the solution?

    We have not had scalability issues.

    How are customer service and technical support?

    The community is very supportive and provided prompt replies and suggestions to JIRA tickets.

    Which solution did I use previously and why did I switch?

    We didn’t have a previous solution. It was a move from RDBMS to big data.

    How was the initial setup?

    Initial setup of a few nodes was simple, but as we increased the node count it became complex, as we need to maintain rack topology, etc.

    What's my experience with pricing, setup cost, and licensing?

    It’s free and it is open source.

    What other advice do I have?

    I would suggest using this product. We were able to use this for petabytes of data.

    Disclosure: My company does not have a business relationship with this vendor other than being a customer.
    PeerSpot user
    PeerSpot user
    Infrastructure Engineer at Zirous, Inc.
    Real User
    Top 20
    The Distributed File System stores video, pictures, JSON, XML, and plain text all in the same file system.

    What is most valuable?

    The Distributed File System, which is the base of Hadoop, has been the most valuable feature with its ability to store video, pictures, JSON, XML, and plain text all in the same file system.

    How has it helped my organization?

    We do use the Hadoop platform internally, but mostly it is for R&D purposes. However, many of the recent projects that our IT consulting firm has taken on have deployed Hadoop as a solution to store high-velocity and highly variable data sizes and structures, and be able to process that data together quickly and efficiently.

    What needs improvement?

    Hadoop in and of itself stores data with 3x redundancy and our organization has come to the conclusion that the default 3x results in too much wasted disk space. The user has the ability to change the data replication standard, but I believe that the Hadoop platform could eventually become more efficient in their redundant data replication. It is an organizational preference and nothing that would impede our organization from using it again, but just a small thing I think could be improved.

    For how long have I used the solution?

    This version was released in January 2016, but I have been working with the Apache Hadoop platform for a few years now.

    What was my experience with deployment of the solution?

    The only issues we found during deployment were errors originating from between the keyboard and the chair. I have set up roughly 20 Hadoop Clusters and mostly all of them went off without a hitch, unless I configured something incorrectly on the pre-setup.

    What do I think about the stability of the solution?

    We have not encountered any stability problems with this platform.

    What do I think about the scalability of the solution?

    We have scaled two of the clusters that we have implemented; one in the cloud, one on-premise. Neither ran into any problems, but I can say with certainty that it is much, much easier to scale in a cloud environment than it is on-premise.

    How are customer service and technical support?

    Customer Service:

    Apache Hadoop is open-source and thus customer service is not really a strong point, but the documentation provided is extremely helpful. More so than some of the Hadoop vendors such as MapR, Cloudera, or Hortonworks.

    Technical Support:

    Again, it's open source. There are no dedicated tech support teams that we've come across unless you look to vendors such as Hortonworks, Cloudera, or MapR.

    Which solution did I use previously and why did I switch?

    We started off using Apache Hadoop for our initial Big Data initiative and have stuck with it since.

    How was the initial setup?

    Initial setup was decently straightforward, especially when using Apache Ambari as a provisioning tool. (I highly recommend Ambari.)

    What about the implementation team?

    We are the implementers.

    What's my experience with pricing, setup cost, and licensing?

    It's open source.

    Which other solutions did I evaluate?

    We solely looked at Hadoop.

    What other advice do I have?

    Try, try, and try again. Experiment with MapReduce and YARN. Fine tune your processes and you will see some insane processing power
    results.

    I would also recommend that you have at least a 12-node cluster: two master nodes, eight compute/data nodes, one hive node (SQL), 1 Ambari dedicated node.

    For the master nodes, I would recommend 4-8 Core, 32-64 GB RAM, 8-10 TB HDD; the data nodes, 4-8 Core, 64 GB RAM, 16-20 TB RAID 10 HDD; hive node should be around 4 Core, 32-64 GB RAM, 5-6 TB RAID 0 HDD; and the Ambari dedicated server should be 2-4 Core, 8-12 GB RAM, 1-2 TB HDD storage.

    Disclosure: My company does not have a business relationship with this vendor other than being a customer.
    PeerSpot user
    it_user340983 - PeerSpot reviewer
    it_user340983Infrastructure Engineer at Zirous, Inc.
    Top 20Real User

    We have since partnered with Hortonworks and are researching into the Cloudera and MapR spaces right now as well. Though our strong suit is Hortonworks, we do have a good implementation team for any of the distributions.

    See all 2 comments
    PeerSpot user
    Senior Hadoop Engineer with 1,001-5,000 employees
    Vendor
    The heart of BigData

    What is most valuable?

    • Storage
    • Processing (cost efficient)

    How has it helped my organization?

    With the increase in data size for the business, this horizontal scalable appliance has answered every business question in terms of storage and processing. Hadoop ecosystem has not only provided a reliable distributed aggregation system but has also allowed room for analytics which has resulted in great data insights.

    What needs improvement?

    The Apache team is doing great job and releasing Hadoop versions much ahead of what we can think about. Every room for improvement is fixed as soon as a version is released by ASF. Currently, Apache Oozie 4.0.1 has some compatibility issues with Hadoop 2.5.2.

    For how long have I used the solution?

    2.5 years

    What was my experience with deployment of the solution?

    Not at all.

    What do I think about the stability of the solution?

    We did when we started initially with Hadoop 1.x, which did’t have HA, but now we don’t have any stability issue.

    What do I think about the scalability of the solution?

    Hadoop is known for its scalability. Yahoo stores approx. 455 PB in their Hadoop cluster.

    How are customer service and technical support?

    Customer Service:

    It depends on the Hadoop distributor. I would rate Hortonworks 9/10.

    Technical Support:

    I would rate Hortonworks 9/10.

    Which solution did I use previously and why did I switch?

    We previously used Netezza. We switched because our business required a highly scalable appliance like Hadoop.

    How was the initial setup?

    It's a bit complex in terms of build around for commodities, but soon it will ease up as the product matures.

    What about the implementation team?

    We used a vendor team who were 9/10.

    What was our ROI?

    Valuable storage and processing with a lower cost than previously.

    What's my experience with pricing, setup cost, and licensing?

    Best in pricing and licensing depends on the flavors, but remember it is only good if you have very large data set which cannot be handled by traditional RDBMS.

    Which other solutions did I evaluate?

    Cloud options.

    What other advice do I have?

    First, understand your business requirement; second, evaluate the traditional RDBMS scalability and capability, and finally, if you have reached to the tip of an iceberg (RDBMS) then yes, you definitely need an island (Hadoop) for your business. Feasibility checks are important and efficient for any business before you can take any crucial step. I would also say “Don’t always flow with stream of a river because some time it will lead you to a waterfall, so always research and analyze before you take a ride.”

    Disclosure: My company does not have a business relationship with this vendor other than being a customer.
    PeerSpot user
    reviewer1040328 - PeerSpot reviewer
    IT Expert at a tech services company with 1,001-5,000 employees
    Real User
    An robust open source software library and framework with many useful tools
    Pros and Cons
    • "I liked that Apache Hadoop was powerful, had a lot of tools, and the fact that it was free and community-developed."
    • "The price could be better. I think we would use it more, but the company didn't want to pay for it. Hortonworks doesn't exist anymore, and Cloudera killed the free version of Hadoop."

    What is our primary use case?

    We used Apache Hadoop mainly for ETL and data analysis.

    What is most valuable?

    I liked that Apache Hadoop was powerful, had a lot of tools, and the fact that it was free and community-developed. 

    What needs improvement?

    The price could be better. I think we would use it more, but the company didn't want to pay for it. Hortonworks doesn't exist anymore, and Cloudera killed the free version of Hadoop.

    For how long have I used the solution?

    I worked with Apache Hadoop for about five years.

    What do I think about the scalability of the solution?

    Apache Hadoop is scalable. We had about 150 people using it at the organization. Some were data scientists, others were from the engineering side, and people from management because Apache Hadoop provided some reports.

    How was the initial setup?

    The initial setup was straightforward. However, it was challenging to make it secure. We managed to do that and implement Kerberos because it's the only way to make Hadoop safe. But it was easy and worked for a few years without any problems. Three people implemented this solution over three months.

    What about the implementation team?

    We implemented this solution.

    What's my experience with pricing, setup cost, and licensing?

    The price could be better. Hortonworks no longer exists, and Cloudera killed the free version of Hadoop.

    What other advice do I have?

    On a scale from one to ten, I would give Apache Hadoop a nine.

    Which deployment model are you using for this solution?

    On-premises
    Disclosure: My company does not have a business relationship with this vendor other than being a customer.
    PeerSpot user
    Buyer's Guide
    Download our free Apache Hadoop Report and get advice and tips from experienced pros sharing their opinions.
    Updated: June 2025
    Product Categories
    Data Warehouse
    Buyer's Guide
    Download our free Apache Hadoop Report and get advice and tips from experienced pros sharing their opinions.