What needs improvement with Data Hub?

Question

Please share with the community what you think needs improvement with Data Hub. What are its weaknesses? What would you like to see changed in a future version?

reviewer2847339 · Accepted Answer

I think Data Hub can be improved by supporting the open source version better. Many features have moved to the paid version now, making it difficult for small-scale companies to operate on Data Hub because we are required to pay, even though it started as an open source project that is now essentially behind a paywall. One needed improvement for Data Hub would be stronger AI-powered metadata discovery. I understand Data Hub has been investing in AI, but the natural language processing power on Data Hub search is not that good. The search itself is not accurate many times. Another improvement could be enhancing the DBT developer experience, such as surfacing DBT test failures directly in lineage. Additionally, when we change schema, if it could provide a risk scoring of some sort, that would also be beneficial. Lastly, automated cleanup recommendations would help because managing orphan data assets on Data Hub currently takes a lot of manual time.

Shubham-Agarwal · Answer

In terms of improvements for Data Hub, it seems more useful for critical or large data pipelines, as small data architectures can be straightforward to understand without it. Regarding enhancements for complex projects, I have noticed that sometimes Data Hub does not provide a complete picture of the lineage, particularly in complex data pipelines such as when we fetch data from an API to S3 and subsequently to Snowflake. We have to review the metadata in Data Hub closely.

Henrique dos Anjos · Answer

I know that the integrations are not easy to do, and I believe it happens because it's a customized solution. There always needs to be software developers to work on this. It's complicated; every time we want to integrate new things or new sources, we need to generate a ticket or a request to another department. When I had my experience with Atlan, for example, I was able to connect different sources in a very user-friendly way. I just needed to set up some configurations and connect to the source without having to be a software developer or develop any code in the back end. It was just a feature in the data catalog that enabled me to connect with different kinds of sources. That's why I think the disadvantage of having a customized solution. Although I think Data Hub itself is a very good tool, years ago I had the opportunity to work with it, but with a clear interface and the open-source solution, which was very clear and easy to connect. At Uber, we need to have a request when we want to integrate new sources. Regarding Data Hub's intuitiveness, regarding analytics, I would say that some quality dimensions are available for us. For example, for each field name or each column in a table, it's possible to see the frequency, how many values we have for a specific type or category, and we can see if there are new or null values, whether the columns are empty or not, along with some metrics. This is regarding the data quality dimensions, such as nullables and things of that nature. That is all we have for features. I remember when I was working with Atlan, there was a feature I liked very much—the possibility to have a sample. When I clicked on a table, I could see a short sample without needing SQL skills. I just clicked the table and could see some values or what the table represents; the data catalog would show a screen with some rows of the table. This feature was very good, but we don't have it in Data Hub the way it is implemented at Uber. I think it would be a very good feature for analytics, and we don't have it at the moment. The integration part could be better, but again, it's because it's a customized solution. I think if they used the native version of the tool, it would be simpler. The integration part and the process of setting up new data quality rules would be important for data governance players like me.

Azhagarasan Annadorai · Answer

Data Hub can be improved with more automation; there are some inbuilt automations, such as documenting definitions of data elements using AI, which is useful. I wonder if it can automate the classification exercise, possibly using AI to auto-classify PII direct and indirect items.

reviewer2784462 · Answer

The impact is very positive, and there are many benefits for us using Data Hub because it was easier to make data governance, create centralized metadata management, improve data discoverability, and manage data in general. The areas for improvement, in my opinion, are the initial setup and configuration that can be complex without prior experience, especially in large-scale environments. User experience for non-technical users could be further simplified, particularly around advanced metadata concepts. The out-of-the-box governance workflow, for example, approvals and certification, could be more prescriptive for customers at early maturity stages. Data Hub can be improved in the initial setup and configuration that is somewhat complex, and also in operational monitoring that could benefit from more native dashboards and alerts. However, these are not blockers, but areas where additional guidance or product enhancement would further accelerate adoption.

reviewer2784771 · Answer

I do not have comments on how Acryl Data can be improved.

reviewer2784384 · Answer

The product cannot be improved in just one area. There are no points in support or documentation that require improvement. There are no improvements needed for Acryl Data that I have not mentioned yet.

What needs improvement with Data Hub?

7 Answers

Related Q&As