Data profiling, data quality reporting
The easiest route - we'll conduct a 15 minute phone interview and write up the review for you.
Use our online form to submit your review. It's quick and you can post anonymously.
Data profiling, data quality reporting
Sometimes a project knows little about its data. IA is good at data profiling / data discovery. It can give insight into data about data type, format, uniqueness, completeness, frequency distribution, etc. The other powerful feature of IA is its ability to check data against business rules. It can give statistics on how many records violate a rule.
Data rules, column analysis, virtual tables
The interface is not the most friendly. Performance.
There are also these following features - documented in the user guide - but do not work:
1. Global Logical Variables (GLVs)
2. Migrating projects. Neither the internal method (Export/Import) nor the command line interface (CLI) method work 100%. They sometimes error out.
3. When you open a data rule and do no modifications, when you close it, IA asks if you want to save the changes, even if you did not make any. A bit disturbing when you know you did not change anything yet you start to doubt what you think you know.
My wish list for new features:
1. Ability to use functions on data sources. I do not understand how IBM could miss this. Data sources are not visible when coding custom expressions. For example if you have a field called CUSTOMER.ACCOUNT_NUM, you cannot code TRIM(ACCOUNT_NUM). My workaround is to create a variable in the rule definition then bind it in the data rule. Functions can only be applied to variables, not directly to fields. I have a rule where I do things to about 12 fields - concatenate, substring, length, coalesce, etc - and I had to make up 12 lines in the definition that do nothing but refer to these variables. I had to invent a rule so I coded seemingly useless rule conditions like address1 = address1 just so I have a variable for the field I want to code functions for. Huge oversight on the part of IBM.
2. Copy a data rule and modify the copy. Right now only rule definitions can be copied, not data rules. Sometimes I need to create two or more versions of the same rule. IA forces me to generate each of them from scratch. This is annoying when version 2 is only slightly different from version 1. If it took me an hour to code the original, it would take me close to that amount of time to code the new version. If I could copy and modify, the effort would only take maybe 5 minutes.
3. The date of last modification. IA only shows the date of creation which is generally useless. The last modification date is far more important and needs to be available and visible.
4. A file manager, a la Windows Explorer. I may want to see the list of rules and sort them by date of modification.
5. Enhanced dedup on output. Currently, IA can only exclude duplicates based on the entire record. It should allow deduping on a select set of columns.
6. Feature to select one record from multiple matches in a join. For instance, in Oracle SQL, one can FETCH FIRST ROW ONLY or use ROWNUM or TOP 1.
7. Ability to sort the output.
8. New virtual tables take a while to appear. You create one and the list doesn't list the new table. Wait 15 minutes or so and maybe it will be listed. Or log out and log back in.
Since 2008.
The tool sometimes crashes or freezes. But the latest version, 11.7, is more than stable than previous ones.
Customer Service:
Scale of 1 to 10: 8. While IBM is excellent at responding to inquiries, it is slow to implement much-needed software fixes. While that is common in the industry, I would still like to see IBM fix software bugs sooner.
Technical Support:
Same as customer service.
Positive
No never had the chance.
I have not been involved in setup but I understand it is very complex, not for the faint of heart.
Excellent!
I was not involved in the selection.
Get the latest version. Compare with competing products. Know that there are not many experts in the product and that they may pay a premium to hire them.
I am working with one of our enterprise customers, managing their newly established cloud warehouse. They are using Snowflake and we are using dbt to manage all the transformation and views and tables in Snowflake. I am not currently working with Cribl, but I used to work with it for almost three years. Currently, I am working with dbt and Snowflake stack.
dbt is a tool that is basically SQL and a little bit of Python, which is somewhat low entry-level, so many of the engineers can use it as well as the analysts. Multiple teams from the business side can use it as well if we allow them. Performance-wise, it mainly depends on the platform that hosts it, whether it is Snowflake or Databricks or BigQuery. There is not much complication. Of course, there are the benefits of having code, so you have a software development lifecycle; you can use version control, testing, and documentation.
I would say the best feature or the most desirable feature for dbt is the ability to write everything in code. It is treating data the same way that Ansible did or Terraform did for infrastructure as code. Now you can code the pipeline instead of using SSIS and Apache NiFi and even Informatica PowerCenter. All of these tools are GUI-based tools. They have a low entry barrier, but you cannot really integrate them in a CI/CD pipeline, for example. For dbt, we can create those. More recently with the advances in AI, LLMs, and code assistant agents, we can hugely leverage those in dbt because I can simply ask the agent to write the code or write the model. However, you cannot really ask them to draw any SSIS package or an Apache NiFi flow, for example.
I think that dbt helps us quite a bit because it exposes a little bit of the functionality of Snowflake directly to us. We can use it with ease because we have some experience with Snowflake and we know what controls to adjust. Because we are a team of multiple individuals, we need to collaborate. Without version control, you have to manage the whole codebase one feature at a time, but what we do is we can use branching and different feature branches. Each one of us is working on their own feature branch. We collaborate, we merge our changes, and we can roll back in case we introduced some bugs. I would say the version control feature is a huge bonus or a huge plus.
We are still experimenting with testing, but not that much. We are not using some features yet. We are trying to introduce them because we are coming from a background of SSIS. The team used to work with SSIS, Microsoft SQL Server Integration Services. We are still adapting one feature at a time. Currently, we are working with the SQL modules and with the Jinja templating. We are experimenting with testing, but I would say towards the end of this year, we are planning to explore more of the documentation and the data lineage options as well.
I would say the benefits are coming from GUI-based tools like SSIS. We have more control on the codebase. We can create something of a system where we can use macros and templating, speeding up the development cycle. We are now trying to introduce a little testing, and also we are using some sort of a CI/CD cycle, so continuous integration and continuous deployment. I do not believe that these kinds of features are that common as a package as a whole package. dbt excels in that area.
I used to have a couple of notes about the performance, but lately I have discovered something called dbt Fusion, which, according to dbt Labs, they proclaim is much faster during the parsing of dbt models. However, I would love to see even more of an out-of-the-box solution regarding the testing. They are treating the testing in a good way so far, but I would love to see even more improvement because the whole data testing field is not very mature. It is not the same as software testing; for example, you have test suites, test tools, and profilers, but for data testing, it is not yet that advanced. I would love for dbt to take the lead on that.
I have been using dbt since September 2024, so almost a year and a half.
I think that one of the issues with dbt is upgrading to later versions because we have some functionalities that we have designed that overcome default behaviors for dbt. Every upgrade is a little bit of a risk for us because we do not know if the workarounds that we developed will be available for the next version. However, in terms of stability, we have had no issue.
I would say we have not experienced scalability issues so far. I am not aware of the scalability, but we are managing it on a very large scale. The bottlenecks that we have are not coming from dbt; they are coming from Snowflake. Once we scale up Snowflake, dbt has no issue whatsoever.
Besides the issue with the upgrades and the default behaviors for the macros that we overwrote, we did not really need to reach out to dbt support.
The team used to work with SSIS before I joined. They used to work with it, I believe, two or three years ago, but since I joined a year and a half ago, they switched to dbt. They switched to dbt just before I joined. I have also worked for some time with Apache NiFi as well.
I am not aware of the initial setup because they set up everything before I joined, but they are using dbt Cloud. I do not think there are many difficulties or any hurdles to overcome during setup. You simply link your dbt Cloud account with the Snowflake account and that is it.
In terms of metrics, I do not have exact metrics, but I get a sense of the speed of opening and closing data requests. I am not that familiar with the Scrum Master of our squad, but I believe our burn chart or something like that, which is an agile metric that measures the finished user stories, is the only sense or only kind of metrics that we have at the moment. However, you do get a sense of accomplishment and the speed of delivering value.
I would say just the testing is something to focus on. dbt Fusion is something I am not completely aware of, but I need to try it because I think it is a great feature, especially because we are dealing with multiple models. For our use case, we are dealing with 50 plus, almost 100 models. Many models are running at the same time. If you add up all the compile time and parsing time, it can add up to quite a bit. dbt Fusion promises that the parsing is much faster in one-tenth of the time, I believe.
I would say you really need to take care of your model and your data model because dbt gives you some freedom. If you do not really know what you are going to do, you can really mess things up. So you need to take care of the model, design your layers, define the responsibilities of each layer, define the criteria of each data layer, define the tests, and that is it.
I would rate this product an eight out of ten.