What needs improvement with Apache Hadoop?

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect...

Download Apache Hadoop Report Read more

Related Q&As

Sep 7, 2023

Which solution do you prefer: Azure Data Factory or Apache Hadoop?

Jul 4, 2023

Which solution is better for setting up a data lake: Apache Hadoop or Oracle Exadata?

score 0 · Answer 1 · 2025-03-27T16:00:37Z

The problem with Apache Hadoop arose when the guys that originally set it up left the firm, and the group that later owned it didn't have enough technical resources to properly maintain it. This was part of the problem, apart from just aging out due to lack of resources.

Satya Raju Archtect - software engineering at Innominds · Answer 2 · 2024-10-15T11:44:00Z

Hadoop lacks OLAP capabilities. I recommend adding a Delta Lake feature to make the data compatible with ACID properties. Also, video and audio streaming import issues could be improved to ensure proper data validation.

Kenechukwu Murphy Ezeoka IT Support Specialist at Convergys Corporation · Answer 3 · 2024-09-04T08:35:24Z

Improvements in security measures would be beneficial, given the large volumes of data handled. Robust security features are essential to prevent data leaks or breaches. Additionally, integrating advanced capabilities similar to those other solutions would enhance the platform's functionality.

Madhan Potluri Head of Data at a energy/utilities company with 51-200 employees · Answer 4 · 2024-07-09T11:49:46Z

The product's availability of comprehensive training materials could be improved for faster onboarding and skill development among team members.

Sushil Arya Software developer at Fiserv · Answer 5 · 2024-06-25T04:40:43Z

When working with Kafka, I saw that the data came in an incremental order. The incremental data processing part is still not very effective in Apache Hadoop. If the data is already there, it can be processed very effectively, especially if the data is coming in every second. If you want to know the location of some data every second, then such data is not processed effectively in Apache Hadoop. I can say that one of the features where improvements are required revolves around the licensing cost of the tool. If the tool can build some licensing structures in a pay-per-use manner, organizations can get the look and feel of Apache Hadoop. Apache Hadoop can offer a licensing structure of the product that can be seen as similar to how AWS operates. Apache Hadoop can look into the capability of processing incremental data. The tool's setup process can be a scope of improvement. Also, it is not very simple because while doing the setup, we need to do all the server settings, including port listing and firewall configurations. If we look at other products on the market, then they can be made simpler. There are certain shortcomings when it comes to the product's technical support part, making it an area where improvements are required. The time frame for the resolution is an area that needs to be improved. The overall communication part of the technical support team also needs improvement.

Akhilesh Chipre Senior Assosiate Consultant at Applied Materials · Answer 6 · 2024-04-11T06:44:03Z

Since it is an open-source product, there won't be much support. So, you have to have deeper knowledge. You need to improvise based on that.

Anand Viswanath Project Manager at Unimity Solutions · Answer 7 · 2024-03-21T12:55:43Z

Tools like Apache Hadoop are knowledge-intensive in nature. Unlike other tools in the market currently, we cannot understand knowledge-intensive products straight away. To use Apache Hadoop, a person needs intensive knowledge, which is something that not everybody can get familiarized with in a straightforward manner. It would be beneficial if navigating through tools like Apache Hadoop is made user-friendly. For non-technical users, if the tool is made easy to navigate, it will be easier to use, and one may not have to depend on experts. The load optimization capabilities of the product are an area of concern where improvements are required. The complex setup phase can be made easier in the future.

Syed Afroz Pasha Head Of Data Governance at Alibaba Group · Answer 8 · 2024-02-27T11:44:49Z

The tool provides functionalities to deal with data skewness or a diverse set of data. There are some configurations that it usually provides. In certain cases, the configurations for dealing with data skewness do not make any sense. We usually have to deal with it using a custom solution. Spark would deal with such cases efficiently. If Hadoop solves the issues the way Spark does, it can compete with Spark at the same level. Hive is a little slower than Spark. Spark is in-memory and parallel processing. Hive is not in-memory, but it is parallel processing.

score 0 · Answer 9 · 2023-12-29T12:05:06Z

The main thing is the lack of community support. If you want to implement a new API or create a new file system, you won't find easy support. And then there's the server issue. You have to create and maintain servers on your own, which can be hectic. Sometimes, the configurations in the documentation don't work, and without a strong community to turn to, you can get stuck. That's where cloud services play a vital role. In future releases, the community needs to be improved a lot. We need a better community, and the documentation should be more accurate for the setup process. Sometimes, we face errors even when following the documentation for server setup and configuration. We need better support. Even if we raise a ticket, it takes a long time to get addressed, and they don't offer online support. They ask for screenshots, which takes even more time. Instead of direct screensharing or hopping on a call. But it's free, so we can't complain too much.

Miodrag Milojevic Senior Data Archirect at Yettel · Answer 10 · 2023-08-01T14:27:00Z

Hadoop isn't so problematic. It deals with file storage and maintenance. It is a network of file operations. The stability of the solution needs improvement.

Yevgen Manzhulyanov CEO at AM-BITS LLC · Answer 11 · 2023-08-01T12:17:28Z

The solution is not easy to use. The solution should be easy to use and suitable for almost any case connected with the use of big data or working with large amounts of data.

Aria Amini Data Engineer at Behsazan Mellat · Answer 12 · 2023-07-26T11:57:00Z

It could be more user-friendly. Other platforms, such as Cloudera, used for big data, are more user-friendly and presented in a more straightforward way. They are also more flexible than Hadoop. Hadoop's scrollback is not easy to use, either.

score 0 · Answer 13 · 2022-09-29T11:28:03Z

In terms of processing speed, I believe that some of this software as well as the Hadoop-linked software can be better. While analyzing massive amounts of data, you also want it to happen quickly. Faster processing speed is definitely an area for improvement. I am not sure about the cloud's technical aspects, whether there are things that happen in the cloud architecture that essentially make it a little slow, but speed could be one. And, second, the Hadoop-linked programs and Hadoop-linked software that are available could do much more and much better in terms of UI and UX. I mentioned it definitely, and this is probably the only feature we can improve a little bit because the terminal and coding screen on Hadoop is a little outdated, and it looks like the old C++ bio screen. If the UI and UX can be improved slightly, I believe it will go a long way toward increasing adoption and effectiveness.

Yogesh Thakkar Business data analyst at RBSG Internet operations · Answer 14 · 2022-09-05T12:51:45Z

We have plans to increase usage and this is where we've realized that when we have all these clusters and we're running queries and analyzing, we are facing some latency issues. I think more of the solution needs to be focused around the panel processing and retrieval of data.

reviewer1040328 IT Expert at a tech services company with 1,001-5,000 employees · Answer 15 · 2022-07-21T16:29:00Z

The price could be better. I think we would use it more, but the company didn't want to pay for it. Hortonworks doesn't exist anymore, and Cloudera killed the free version of Hadoop.

Juliet Hoimonthi Manager at Robi Axiata Limited · Answer 16 · 2022-07-05T06:27:00Z

What could be improved in Apache Hadoop is its user-friendliness. It's not that user-friendly, but maybe it's because I'm new to it. Sometimes it feels so tough to use, but it could be because of two aspects: one is my incompetency, for example, I don't know about all the features of Apache Hadoop, or maybe it's because of the limitations of the platform. For example, my team is maintaining the business glossary in Apache Atlas, but if you want to change any settings at the GUI level, an advanced level of coding or programming needs to be done in the back end, so it's not user-friendly.

DulalMali Data Analytics Practice head at bse · Answer 17 · 2022-04-27T08:19:03Z

The integration with Apache Hadoop with lots of different techniques within your business can be a challenge.

Donghan Kim R&D Head, Big Data Adjunct Professor at SK Communications Co., Ltd. · Answer 18 · 2022-01-14T10:24:00Z

Apache Hadoop's real-time data processing is weak and is not enough to satisfy our customers, so we may have to pick other products. We are continuously researching other solutions and other vendors. Another weak point of this solution, technically speaking, is that it's very difficult to run and difficult to smoothly implement. Preparation and integration are important. The integration of this solution with other data-related products and solutions, and having other functions, e.g. API connectivity, are what I want to see in the next release.

reviewer901065 Partner at a tech services company with 11-50 employees · Answer 19 · 2021-10-05T18:57:00Z

reviewer901065

Partner at a tech services company with 11-50 employees

Real User

Oct 5, 2021

Hadoop's security could be better.

reviewer1464630 Founder & CTO at a tech services company with 1-10 employees · Answer 20 · 2020-12-08T22:10:56Z

I don't have any concerns because each part of Hadoop has its use cases. To date, I haven't implemented a huge product or project using Hadoop, but on the level of POCs, it's fine. The community of Hadoop is now a cluster, I think there is room for improvement in the ecosystem. From the Apache perspective or the open-source community, they need to add more capabilities to make life easier from a configuration and deployment perspective.

reviewer1433400 Technical Lead at a government with 201-500 employees · Answer 21 · 2020-10-19T09:33:27Z

For the visualization tools, we use Apache Hadoop and it is very slow. It lacks some query language. We have to use Apache Linux. Even so, the query language still has limitations with just a bit of documentation and many of the visualization tools do not have direct connectivity. They need something like BigQuery which is very fast. We need those to be available in the cloud and scalable. The solution needs to be powerful and offer better availability for gathering queries. The solution is very expensive.

score 0 · Answer 22 · 2020-07-14T08:15:56Z

I'm not sure if I have any ideas as to how to improve the product. Every year, the solution comes out with new features. Spark is one new feature, for example. If they could continue to release new helpful features, it will continue to increase the value of the solution. The solution could always improve performance. This is a consistent requirement. Whenever you run it, there is always room for improvement in terms of performance. The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning. We would prefer it if users didn't just get pushed through to certification-based learning, as certifications are expensive. Maybe if they could arrange it so that the certification was at a lesser cost. The certification cost is currently around $2,500 or thereabout.

Abhik Ray Co-Founder at Quantic · Answer 23 · 2020-02-07T02:52:00Z

It would be helpful to have more information on how to best apply this solution to smaller organizations, with less data, and grow the data lake.

it_user1093134 Technical Architect at RBSG Internet Operations · Answer 24 · 2019-12-16T08:14:00Z

We're finding vulnerabilities in running it 24/7. We're experiencing some downtime that affects the data. It would be good to have more advanced analytics tools.

score 0 · Answer 25 · 2019-12-16T08:13:00Z

It could be because the solution is open source, and therefore not funded like bigger companies, but we find the solution runs slow. The solution isn't as mature as SQL or Oracle and therefore lacks many features. The solution could use a better user interface. It needs a more effective GUI in order to create a better user environment.

Yevgen Manzhulyanov CEO at AM-BITS LLC · Answer 26 · 2019-11-27T05:42:00Z

What needs improvement depends on the customer and the use case. The classical Hadoop, for example, we consider an old variant. Most now work with flash data. There is a very wide application for this solution, but in enterprise companies, if you work with classical BI systems, it would be good to include an additional presentation layer for BI solutions. There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution.

Lucas Dreyer Data Engineer at BBD · Answer 27 · 2019-09-29T07:27:00Z

Hadoop itself is quite complex, especially if you want it running on a single machine, so to get it set up is a big mission. It seems that Hadoop is on it's way out and Spark is the way to go. You can run Spark on a single machine and it's easier to setup. In the next release, I would like to see Hive more responsive for smaller queries and to reduce the latency. I don't think that this is viable, but if it is possible, then latency on smaller guide queries for analysis and analytics. I would like a smaller version that can be run on a local machine. There are installations that do that but are quite difficult, so I would say a smaller version that is easy to install and explore would be an improvement.

reviewer1040328 IT Expert at a tech services company with 1,001-5,000 employees · Answer 28 · 2019-07-28T07:35:00Z

We are using HDTM circuit boards, and I worry about the future of this product and compatibility with future releases. It's a concern because, for now, we do not have a clear path to upgrade. The Hadoop product is in version three and we'd like to upgrade to the third version. But as far as I know, it's not a simple thing. There are a lot of features in this product that are open-source. If something isn't included with the distribution we are not limited. We can take things from the internet and integrate them. As far as I know, we are using Presto which isn't included in HDP (Hortonworks Data Platform) and it works fine. Not everything has to be included in the release. If something is outside of HDP and it works, that is good enough for me. We have the flexibility to incorporate it ourselves.

MahalingamShanmugam Works · Answer 29 · 2019-07-16T01:59:00Z

We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it.

score 0 · Answer 30 · 2018-08-14T07:42:00Z

In general, Hadoop has as lot of different component parts to the platform - things like Hive and HBase - and they're all moving somewhat independently and somewhat in parallel. I think as you look to platforms in the cloud or into walled-garden concepts, like Cloudera or Azure, you see that the third-party can make sure all the components work together before they are used for business purposes. That reduces a layer of administration configuration and technical support. I would like to see more direct integration of visualization applications.