Please share with the community what you think needs improvement with Apache Hadoop.
What are its weaknesses? What would you like to see changed in a future version?
Hadoop's security could be better.
I don't have any concerns because each part of Hadoop has its use cases. To date, I haven't implemented a huge product or project using Hadoop, but on the level of POCs, it's fine. The community of Hadoop is now a cluster, I think there is room for improvement in the ecosystem. From the Apache perspective or the open-source community, they need to add more capabilities to make life easier from a configuration and deployment perspective.
For the visualization tools, we use Apache Hadoop and it is very slow. It lacks some query language. We have to use Apache Linux. Even so, the query language still has limitations with just a bit of documentation and many of the visualization tools do not have direct connectivity. They need something like BigQuery which is very fast. We need those to be available in the cloud and scalable. The solution needs to be powerful and offer better availability for gathering queries. The solution is very expensive.
I'm not sure if I have any ideas as to how to improve the product. Every year, the solution comes out with new features. Spark is one new feature, for example. If they could continue to release new helpful features, it will continue to increase the value of the solution. The solution could always improve performance. This is a consistent requirement. Whenever you run it, there is always room for improvement in terms of performance. The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning. We would prefer it if users didn't just get pushed through to certification-based learning, as certifications are expensive. Maybe if they could arrange it so that the certification was at a lesser cost. The certification cost is currently around $2,500 or thereabout.
It would be helpful to have more information on how to best apply this solution to smaller organizations, with less data, and grow the data lake.
We're finding vulnerabilities in running it 24/7. We're experiencing some downtime that affects the data. It would be good to have more advanced analytics tools.
It could be because the solution is open source, and therefore not funded like bigger companies, but we find the solution runs slow. The solution isn't as mature as SQL or Oracle and therefore lacks many features. The solution could use a better user interface. It needs a more effective GUI in order to create a better user environment.
What needs improvement depends on the customer and the use case. The classical Hadoop, for example, we consider an old variant. Most now work with flash data. There is a very wide application for this solution, but in enterprise companies, if you work with classical BI systems, it would be good to include an additional presentation layer for BI solutions. There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution.
Hadoop itself is quite complex, especially if you want it running on a single machine, so to get it set up is a big mission. It seems that Hadoop is on it's way out and Spark is the way to go. You can run Spark on a single machine and it's easier to setup. In the next release, I would like to see Hive more responsive for smaller queries and to reduce the latency. I don't think that this is viable, but if it is possible, then latency on smaller guide queries for analysis and analytics. I would like a smaller version that can be run on a local machine. There are installations that do that but are quite difficult, so I would say a smaller version that is easy to install and explore would be an improvement.
We are using HDTM circuit boards, and I worry about the future of this product and compatibility with future releases. It's a concern because, for now, we do not have a clear path to upgrade. The Hadoop product is in version three and we'd like to upgrade to the third version. But as far as I know, it's not a simple thing. There are a lot of features in this product that are open-source. If something isn't included with the distribution we are not limited. We can take things from the internet and integrate them. As far as I know, we are using Presto which isn't included in HDP (Hortonworks Data Platform) and it works fine. Not everything has to be included in the release. If something is outside of HDP and it works, that is good enough for me. We have the flexibility to incorporate it ourselves.
We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it.
In general, Hadoop has as lot of different component parts to the platform - things like Hive and HBase - and they're all moving somewhat independently and somewhat in parallel. I think as you look to platforms in the cloud or into walled-garden concepts, like Cloudera or Azure, you see that the third-party can make sure all the components work together before they are used for business purposes. That reduces a layer of administration configuration and technical support. I would like to see more direct integration of visualization applications.
What do you like most about Apache Hadoop?
Thanks for sharing your thoughts with the community!