Real User
We are able to ingest huge volumes/varieties of data, but it needs a data visualization tool and enhanced Ambari for management
Pros and Cons
  • "Initially, with RDBMS alone, we had a lot of work and few servers running on-premise and on cloud for the PoC and incubation. With the use of Hadoop and ecosystem components and tools, and managing it in Amazon EC2, we have created a Big Data "lab" which helps us to centralize all our work and solutions into a single repository. This has cut down the time in terms of maintenance, development and, especially, data processing challenges."
  • "Since both Apache Hadoop and Amazon EC2 are elastic in nature, we can scale and expand on demand for a specific PoC, and scale down when it's done."
  • "Most valuable features are HDFS and Kafka: Ingestion of huge volumes and variety of unstructured/semi-structured data is feasible, and it helps us to quickly onboard a new Big Data analytics prospect."
  • "Based on our needs, we would like to see a tool for data visualization and enhanced Ambari for management, plus a pre-built IoT hub/model. These would reduce our efforts and the time needed to prove to a customer that this will help them."
  • "General installation/dependency issues were there, but were not a major, complex issue. While migrating data from MySQL to Hive, things are a little challenging, but we were able to get through that with support from forums and a little trial and error."

What is our primary use case?

Big Data analytics, customer incubation. 

We host our Big Data analytics "lab" on Amazon EC2. Customers are new to Big Data analytics so we do proofs of concept for them in this lab. Customers bring historical, structured data, or IoT data, or a blend of both. We ingest data from these sources into the Hadoop environment, build the analytics solution on top, and prove the value and define the roadmap for customers.

How has it helped my organization?

Initially, with RDBMS alone, we had a lot of work and few servers running on-premise and on cloud for the PoC and incubation. With the use of Hadoop and ecosystem components and tools, and managing it in Amazon EC2, we have created a Big Data "lab" which helps us to centralize all our work and solutions into a single repository. This has cut down the time in terms of maintenance, development and, especially, data processing challenges. 

We were using MySQL and PostgreSQL for these engagements, and scaling and processing were not as easy when compared to Hadoop. Also, customers who are embarking on a big journey with semi-structured information prefer to use Hadoop rather than a RDBMS stack. This gives them clarity on the requirements.

In addition, since both Apache Hadoop and Amazon EC2 are elastic in nature, we can scale and expand on demand for a specific PoC, and scale down when it's done.

Flexibility, ease of data processing, reduced cost and efforts are the three key improvements for us.

What is most valuable?

HDFS and Kafka: Ingestion of huge volumes and variety of unstructured/semi-structured data is feasible, and it helps us to quickly onboard a new Big Data analytics prospect.

What needs improvement?

Based on our needs, we would like to see a tool for data visualization and enhanced Ambari for management, plus a pre-built IoT hub/model. These would reduce our efforts and the time needed to prove to a customer that this will help them.

Buyer's Guide
Apache Hadoop
April 2024
Learn what your peers think about Apache Hadoop. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,334 professionals have used our research since 2012.

For how long have I used the solution?

Less than one year.

What do I think about the stability of the solution?

We have a three-node cluster running on cloud by default, and it has been stable so far without any stoppages due to Hadoop or other ecosystem components.

What do I think about the scalability of the solution?

Since this is primarily for customer incubation, there is a need to process huge volumes of data, based on the proof of value engagement. During these processes, we scale the number of instances on demand (using Amazon spot instances), use them for a defined period, and scale down when the PoC is done. This gives us good flexibility and we pay only for usage.

How are customer service and support?

Since this is mostly community driven, we get a lot of input from the forums and our in-house staff who are skilled in doing the job. So far, most of the issues we have had during setup or scaling have primarily been on the infrastructure side and not on the stack. For most of the problems we get answers from the community forums.

How was the initial setup?

We didn't have any major issues except for knowledge, so we hired the right person who had hands-on experience with this stack, and worked with the cloud provider to get the right mechanism for handling the stack.

General installation/dependency issues were there, but were not a major, complex issue. While migrating data from MySQL to Hive, things are a little challenging, but we were able to get through that with support from forums and a little trial and error. In addition, the old PoCs which were migrated had issues in directly connecting to Hive. We had to build some user functions to handle that.

What's my experience with pricing, setup cost, and licensing?

We normally do not suggest any specific distributions. When it comes to cloud, our suggestion would be to choose different types of instances offered by Amazon cloud, as we are technology partners of Amazon for cost savings. For all our PoCs, we stick to the default distribution.

Which other solutions did I evaluate?

None, as this stack is familiar to us and we were sure it could be used for such engagements without much hassle. Our primary criteria were the ability to migrate our existing RDBMS-based PoC and connectivity via our ETL and visualization tool. On top of that, support for semi-structured data for ELT. All three of these criteria were a fit with this stack.

What other advice do I have?

Our general suggestion to any customer is not to blindly look and compare different options. Rather, list the exact business needs - current and future - and then prepare a matrix to see product capabilities and evaluate costs and other compliance factors for that specific enterprise.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Business data analyst at RBSG Internet operations
Real User
Top 20
A low-cost solution that allows us to download data, but has latency issues when running queries
Pros and Cons
  • "One valuable feature is that we can download data."
  • "I think more of the solution needs to be focused around the panel processing and retrieval of data."

What is our primary use case?

We use the solution as a data link for our customer payment and SaaS information. We get data from various sources and then utilize and leverage that data.

What is most valuable?

One valuable feature is that we can download data. Another is that it is a low-cost solution. Hadoop has also made it feasible to have all the data available in one area.

What needs improvement?

We have plans to increase usage and this is where we've realized that when we have all these clusters and we're running queries and analyzing, we are facing some latency issues. I think more of the solution needs to be focused around the panel processing and retrieval of data. 

For how long have I used the solution?

I have been using this solution for about seven or eight years. 

What do I think about the stability of the solution?

This is a stable product. 

What do I think about the scalability of the solution?

The scalability of the solution is good. Approximately 100 people are currently using this solution within our company. 

How are customer service and support?

I would rate the tech support as a four out of five. 

How would you rate customer service and support?

Positive

What other advice do I have?

I would recommend this product to others. I would rate it as an eight out of ten. 

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Hadoop
April 2024
Learn what your peers think about Apache Hadoop. Get advice and tips from experienced pros sharing their opinions. Updated: April 2024.
769,334 professionals have used our research since 2012.
Vice President - Finance & IT at a consumer goods company with 1-10 employees
Real User
Great micro-partitions, helpful technical support and quite stable
Pros and Cons
  • "The solution is easy to expand. We haven't seen any issues with it in that sense. We've added 10 servers, and we've added two nodes. We've been expanding since we started using it since we started out so small. Companies that need to scale shouldn't have a problem doing so."
  • "The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning."

What is our primary use case?

As an example of a use case, when I was a contractor for Cisco, we were processing mobile network data and the volume was too big. RDBMS was not supporting anything. We started using the Hadoop framework to improve the process and get the results faster.

What is most valuable?

The data is stored in micro-partitions which makes the processes very fast compared to other RDBMS systems. Apache Spark is in the memory process, and it's much better than MapReduce.

Micro-partitions and the HDFS are both excellent features.

What needs improvement?

I'm not sure if I have any ideas as to how to improve the product.

Every year, the solution comes out with new features. Spark is one new feature, for example. If they could continue to release new helpful features, it will continue to increase the value of the solution.

The solution could always improve performance. This is a consistent requirement. Whenever you run it, there is always room for improvement in terms of performance.

The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning.

We would prefer it if users didn't just get pushed through to certification-based learning, as certifications are expensive. Maybe if they could arrange it so that the certification was at a lesser cost. The certification cost is currently around $2,500 or thereabout. 

For how long have I used the solution?

I've been using the solution for four years.

What do I think about the stability of the solution?

We haven't had too many problems with stability. For the POC we used a small amount of data and we started with 10 nodes. We're gradually increasing in now to 40 nodes. We haven't seen any issues after the small teething period in the beginning. The configuration issues and the performance issues have subsided. Once we learned how to stack everything, it has been much better.

What do I think about the scalability of the solution?

The solution is easy to expand. We haven't seen any issues with it in that sense. We've added 10 servers, and we've added two nodes. We've been expanding since we started using it since we started out so small. Companies that need to scale shouldn't have a problem doing so.

We are supporting a multitenancy model and we get the data on supporting the users. I would say, per organization, we have eight to 10 users and probably have a total of around 40 users across the board.

How are customer service and technical support?

We started on the solution as a POC. Once we got into production, we had some minor issues. We get great support. They share advice and helped us tweak some things in terms of the configurations. We've been satisfied with the level of service we've been provided.

Which solution did I use previously and why did I switch?

We have only ever used Apache Hadoop, or a version of it. When we looked for the commercial tier, there was Cloudera and Hortonworks. We started with the Hortonworks due to the fact that at that time we felt it was cost-effective. However, Cloudera bought Hadoop and Hortonworks and now it's all basically the same solution.  

How was the initial setup?

The initial setup was a little complex the first time around. We were new to the system, and we didn't have any expertise at that time. Once we get some support and insights into how to work everything properly it went more smoothly.

First, we started with a POC - proof of concept. It takes a couple of days in terms of understanding and configuring everything, etc. When we went to production, it was a couple of hours for deployment and we put into practice everything we learned from the POC.

There's definitely a learning curve. It's stable for us now. 

We have a team of developers doing multiple tasks on the solution and few of them are taking care of Hadoop, so we do have a few people handling maintenance.

What about the implementation team?

As we were new to the solution, we found we needed some outside assistance to guide us. However, that was for the POC. In the end, I did it myself. 

What other advice do I have?

We're just a customer. We don't have a business relationship with Hadoop. 

My day-to-day job is data modeling and architecting.

Originally we used it as an open-source solution. We downloaded it, then we went for a commercial version of it.

In terms of advice, I'd tell other potential users that whether the solution is right for them depends on a few items. If the data volume is too big, it's IoT data, or the stream of data is too much, this solution can handle it and I would definitely recommend Apache Hadoop. 

Recently, in the last 18 months, I've been working with the Snowflake, it's a Data Lake project, and I am really impressed with that one. I got a certification so that we started using Snowflake set for our Data Lake environment.

I'd rate the solution eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Real User
Reduces cost, saves time, and provides insight into our unstructured data
Pros and Cons
  • "The most valuable features are the ability to process the machine data at a high speed, and to add structure to our data so that we can generate relevant analytics."
  • "We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it."

What is our primary use case?

We use this solution for our Enterprise Data Lake.

How has it helped my organization?

Using this solution has reduced the overall TCO. It has also improved data processing time for the machine and provides greater insight into our unstructured data.

What is most valuable?

The most valuable features are the ability to process the machine data at a high speed, and to add structure to our data so that we can generate relevant analytics.

What needs improvement?

We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it.

For how long have I used the solution?

More than four years.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
PeerSpot user
Senior Hadoop Engineer with 1,001-5,000 employees
Vendor
The heart of BigData

What is most valuable?

  • Storage
  • Processing (cost efficient)

How has it helped my organization?

With the increase in data size for the business, this horizontal scalable appliance has answered every business question in terms of storage and processing. Hadoop ecosystem has not only provided a reliable distributed aggregation system but has also allowed room for analytics which has resulted in great data insights.

What needs improvement?

The Apache team is doing great job and releasing Hadoop versions much ahead of what we can think about. Every room for improvement is fixed as soon as a version is released by ASF. Currently, Apache Oozie 4.0.1 has some compatibility issues with Hadoop 2.5.2.

For how long have I used the solution?

2.5 years

What was my experience with deployment of the solution?

Not at all.

What do I think about the stability of the solution?

We did when we started initially with Hadoop 1.x, which did’t have HA, but now we don’t have any stability issue.

What do I think about the scalability of the solution?

Hadoop is known for its scalability. Yahoo stores approx. 455 PB in their Hadoop cluster.

How are customer service and technical support?

Customer Service:

It depends on the Hadoop distributor. I would rate Hortonworks 9/10.

Technical Support:

I would rate Hortonworks 9/10.

Which solution did I use previously and why did I switch?

We previously used Netezza. We switched because our business required a highly scalable appliance like Hadoop.

How was the initial setup?

It's a bit complex in terms of build around for commodities, but soon it will ease up as the product matures.

What about the implementation team?

We used a vendor team who were 9/10.

What was our ROI?

Valuable storage and processing with a lower cost than previously.

What's my experience with pricing, setup cost, and licensing?

Best in pricing and licensing depends on the flavors, but remember it is only good if you have very large data set which cannot be handled by traditional RDBMS.

Which other solutions did I evaluate?

Cloud options.

What other advice do I have?

First, understand your business requirement; second, evaluate the traditional RDBMS scalability and capability, and finally, if you have reached to the tip of an iceberg (RDBMS) then yes, you definitely need an island (Hadoop) for your business. Feasibility checks are important and efficient for any business before you can take any crucial step. I would also say “Don’t always flow with stream of a river because some time it will lead you to a waterfall, so always research and analyze before you take a ride.”

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Partner at a tech services company with 11-50 employees
Real User
Highly elastic and stable, but it needs better security
Pros and Cons
  • "Hadoop is extensible — it's elastic."
  • "Hadoop's security could be better."

What is our primary use case?

There are several use cases for Hadoop. Sometimes it's used for data warehousing. Other times, it's analytics. And In some cases, it's used to do transformation. For example, I have one client using it to decompress, compress, or encrypt data on ingestion. So, he used it like an ETL engine.

What is most valuable?

Hadoop is extensible — it's elastic.

What needs improvement?

Hadoop's security could be better.

For how long have I used the solution?

I've been using Hadoop for about eight years. I'm not sure exactly.

What do I think about the stability of the solution?

Performance is one of the reasons people choose Hadoop.

What do I think about the scalability of the solution?

Scalability is one of Hadoop's strong suits.

How are customer service and support?

I've never had to use Hadoop support. 

How was the initial setup?

The complexity of Hadoop's setup depends on the customer and their needs. However, most of my customers wind up using Hadoop as a service, which makes it very easy. It doesn't need much maintenance. My staff maintains multiple systems, so it's not like there would ever be somebody dedicated to one, and Hadoop is not a high-touch platform.

What other advice do I have?

I rate Hadoop seven out of 10. It's very good, but it could always be better. To anyone considering Hadoop, I recommend that you be mindful of what you're trying to achieve.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
Disclosure: My company has a business relationship with this vendor other than being a customer: Implementer
PeerSpot user
IT Expert at a comms service provider with 1,001-5,000 employees
Real User
Top 20
An robust open source software library and framework with many useful tools
Pros and Cons
  • "I liked that Apache Hadoop was powerful, had a lot of tools, and the fact that it was free and community-developed."
  • "The price could be better. I think we would use it more, but the company didn't want to pay for it. Hortonworks doesn't exist anymore, and Cloudera killed the free version of Hadoop."

What is our primary use case?

We used Apache Hadoop mainly for ETL and data analysis.

What is most valuable?

I liked that Apache Hadoop was powerful, had a lot of tools, and the fact that it was free and community-developed. 

What needs improvement?

The price could be better. I think we would use it more, but the company didn't want to pay for it. Hortonworks doesn't exist anymore, and Cloudera killed the free version of Hadoop.

For how long have I used the solution?

I worked with Apache Hadoop for about five years.

What do I think about the scalability of the solution?

Apache Hadoop is scalable. We had about 150 people using it at the organization. Some were data scientists, others were from the engineering side, and people from management because Apache Hadoop provided some reports.

How was the initial setup?

The initial setup was straightforward. However, it was challenging to make it secure. We managed to do that and implement Kerberos because it's the only way to make Hadoop safe. But it was easy and worked for a few years without any problems. Three people implemented this solution over three months.

What about the implementation team?

We implemented this solution.

What's my experience with pricing, setup cost, and licensing?

The price could be better. Hortonworks no longer exists, and Cloudera killed the free version of Hadoop.

What other advice do I have?

On a scale from one to ten, I would give Apache Hadoop a nine.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
CEO at AM-BITS LLC
Real User
Top 10
Good stability and scalability but the visualization isn't good
Pros and Cons
  • "The ability to add multiple nodes without any restriction is the solution's most valuable aspect."
  • "There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution."

What is our primary use case?

We primarily use the solution for the enterprise data hub and big data warehouse extension.

What is most valuable?

The ability to add multiple nodes without any restriction is the solution's most valuable aspect.

What needs improvement?

What needs improvement depends on the customer and the use case. The classical Hadoop, for example, we consider an old variant. Most now work with flash data.

There is a very wide application for this solution, but in enterprise companies, if you work with classical BI systems, it would be good to include an additional presentation layer for BI solutions.

There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution. 

For how long have I used the solution?

We've been working with the solution for three to four years.

What do I think about the stability of the solution?

The solution is stable. It has very good disaster stability and multi-rack configuration.

What do I think about the scalability of the solution?

It is possible to scale the solution. We work with companies that have hundreds of users.

How was the initial setup?

The initial setup might not be straightforward for our customers, but it's easy enough for us to handle. However, if we don't build a proof of concept for the company first it may take some time and be quite complex. Pilot projects take about three months to deploy and full spec projects take up to a year because we have to work in all requirements in data governance, security, etc.

What's my experience with pricing, setup cost, and licensing?

We originally built on Hortonworks tech which didn't require any licensing, but that is getting discontinued in 2022, so it's been proposed we move to Cloudera which will have licensing costs associated with it.

What other advice do I have?

We use the on-premises deployment model. It's a requirement for the company we work with, which is a bank. Often customers demand we work with on-premises deployment models.

I'd rate the solution seven out of ten. In terms of the ability to build middleware and offer scalability, it would be 10 out of 10 from me. However,  if you take into account only the visualization, I'd only rate it at three or four out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Download our free Apache Hadoop Report and get advice and tips from experienced pros sharing their opinions.
Updated: April 2024
Product Categories
Data Warehouse
Buyer's Guide
Download our free Apache Hadoop Report and get advice and tips from experienced pros sharing their opinions.