What is our primary use case?
We used it for ETL to transform data from flat files, CSV files, and database. We used PostgreSQL for the connections, and then we would either import it into our database if the data was in from clients, or we would export it to files if clients wanted files or if a vendor needed to import the files into their database.
How has it helped my organization?
The biggest benefit is that it's a low-code solution. When you hire junior ETL developers or engineers, who may have a schooling background but no real experience with ETL or coding for ETL, it's a UI-based, low-code solution in which they can make something happen within weeks instead of, potentially, months.
Because it's low-code, while I could technically have done everything in Python alone, that would definitely have taken longer than using Pentaho. In addition, by being able to standardize pipelines to handle the onboarding process for new clients, development costs were significantly reduced. To put in perspective, prior to my leading the effort to standardize things, it would typically take about a week to build a feed from start to finish, and sometimes more depending on how complicated it was. With this solution, instead of it taking a week, it was reduced to an afternoon, or about three hours. That was a significant difference.
Instead of paying a developer a full week's worth of work, which could be $2,500 or more, it cut it down to three hours or about $300. That's a big difference.
What is most valuable?
The fact that it's a low-code solution is valuable. It's good for more junior people who may not be as experienced with programming. In our case, we didn't have a huge data set. We had small and medium-sized data sets, so it worked fine.
The fact that it's open source is also helpful in that, if a junior engineer knows they are going to use it in a job, they can download it themselves, locally, for free, and use test data to learn it.
My role was to use it to write one feed that could facilitate multiple clients. Given that it was an open-source, free solution, it was pretty robust in what it could do. I could make lookup tables and databases and map different clients, and I could use the same feed for 30 clients or 50 clients. It got the job done for our use case.
In addition, you can install it wherever you need it. We had installed versions in the cloud and I also had local versions.
What needs improvement?
If you're working with a larger data set, I'm not so sure it would be the best solution. The larger things got the slower it was.
It was kind of buggy sometimes. And when we ran the flow, it didn't go from a perceived start to end, node by node. Everything kicked off at once. That meant there were times when it would get ahead of itself and a job would fail. That was not because the job was wrong, but because Pentaho decided to go at everything at once, and something would process before it was supposed to. There were nodes you could add to make sure that, before this node kicks off, all these others have processed, but it was a bit tedious.
There were also caching issues, and we had to write code to clear the cache every time we opened the program, because the cache would fill up and it wouldn't run. I don't know how hard that would be for them to fix, or if it was fixed in version 10.
Also, the UI is a bit outdated, but I'm more of a fan of function over how something looks.
One other thing that would have helped with Pentaho was documentation and support on the internet: how to do things, how to set up. I think there are some sites on how to install it, and Pentaho does have a help repository, but it wasn't always the most useful.
For how long have I used the solution?
I used Hitachi Lumada Data Integration (Pentaho) for three years
What do I think about the stability of the solution?
In terms of the stability of the solution, as I noted, I wouldn't use it for large data sets. But for small to midsize companies that are looking for a low-code solution that isn't going to break the budget, it's a great tool for them to use.
It worked and it was stable enough, once we figured out the little quirks and how to get around them. It mostly handled our production workflows without issue.
What do I think about the scalability of the solution?
I think it could scale, but only up to a point. I didn't test it on larger datasets. But after talking to people who have worked on larger datasets, they wouldn't recommend using it, but that is hearsay.
In my former company, there were about five people in the data engineering department who were using the solution in their roles as ETL data integration Specialists.
In that company, it's their go-to solution and I think it will work for everything that they need. When I was there, I tried opening pathways to different things, but there were so many feeds already on it, and it worked for what they need, and it's low-code and open source, so I think they'll stick with it. As they gain more clients they'll increase their usage of it.
How was the initial setup?
The initial setup wasn't that complicated. You have to set the job environment variables and that was probably the most complicated part, and would be especially so if you're not familiar with it. Otherwise, it was just a matter of downloading the version needed, installing it, and learning how to use the different components. Overall, it was pretty easy and straightforward.
The first time we deployed it, not knowing what we were doing, it took a couple of days, but that was mainly troubleshooting and figuring out what we were doing wrong because we hadn't used it before. After that, it would take maybe 30 minutes or an hour.
In terms of maintenance for Pentaho, one developer per feed is what is typically assigned. It will depend on the workflow of the company and how many feeds are needed. In our case there were five people involved.
What was our ROI?
It saved us a lot of money. Given that it's open source, and the amount of time over the three that I used it, and the fact that they were using it several years prior, means a lot of money was definitely saved by using Pentaho versus something else.
What's my experience with pricing, setup cost, and licensing?
If a company is looking for an ETL solution and wants to integrate it with their tech stack but doesn't want to spend a bunch of money, Pentaho is a good solution. SSIS cores were $10,000 a piece. Although I don't know what they cost nowadays, they're expensive.
Pentaho is a nice option without having to pay an arm and a leg. We even had a complicated data set and Pentaho was able to handle pretty much every type of scenario, if we thought about it creatively enough. I would recommend it for a company in that position.
Which other solutions did I evaluate?
While the capabilities of Pentaho are good enough for light work, I've started using Alteryx Designer, and it is so much more robust in everything that you can do in real time. I've also used SSIS.
When you run something in Pentaho, you can click on it to see the output of each one, but it's hard to really change anything. For example, if I were to query data from a database and put it into a "select," if I wanted to reorganize within the select based on something like the first initial of someone's name, it provided that option. But when I would do it, sometimes it would throw an error and I'd have to run the feed again to see it.
The nodes, or the components, in Pentaho can probably do about 70 percent of what you can do in Alteryx. Don't get me wrong, Pentaho worked for what we needed it for, with just a few quirks. But as a data engineer, I'm always interested in and excited to work with new technologies that may offer different benefits. In this case, one of the benefits is that each node in Alteryx has many more capabilities in real time. I can look at the data that's coming into the node and the data that's going out. There was a way to do that in Pentaho, if you right-clicked and looked, but it would tell you the fields that were coming in and out and not necessarily the data. It's nice to be able to troubleshoot, on the spot, node-by-node, if you're having an issue. You can do that easily with Alteryx.
In addition to being able to look at data coming in and out of the node, you can also sort it easily and filter it within each data node in Alteryx, and that is something you can't do in Pentaho.
Another cool thing with Alteryx, although it's a very small difference, is that you don't have to save the workflow before you run it. Pentaho forces you to do that. Of course, it's always good to save.
What other advice do I have?
A good thing about Pentaho is that it's not that hard to learn, from an ETL perspective. The way that Pentaho has things laid out they are pretty intuitively organized in the panel: Your input—flat file, CSV, or database—and then the transformation nodes.
It was a good baseline and a good open-source tool to use to learn ETL. It's good to have exposure to multiple tools because every company has different needs and, depending on their needs, it would be a different recommendation.
The lessons I learned using it: Make sure you clear the cache when you open the program. Also, if there are any critical points in your flow that are dependent upon previous nodes, make sure that you put blocking steps in. Make sure you also set up the job environment variables correctly, so that Pentaho runs.
It worked for what we did but, personally, I wouldn't use it. In the new company I'm working for, we are using large financial data sets and I'm not so sure it could handle that. I know there's an Enterprise version, but I didn't use that.
The solution can handle ingestion through to export, but you still have to have a batch or Python script to run it with an automation process. I don't know if the Lumada version has something different, but with what I was using, you were simply building the pipeline, but the pipeline outside of the program had to be scheduled and run, and we had other tools to check that the output was as expected.
We used version 7 for a while and we were reluctant to upgrade to version 9 because we had an 834 configuration, meaning a government standardized feed that our developer spent two years building. There was an issue whenever we tried to run those feeds on version 9, so we were reluctant to upgrade because things were working on 7. We ended up finding out that it didn't take much work for us to fix the problem that we were having with version 9 and, eventually, we moved to it. With every version upgrade of anything, there are going to be pros and cons.
Depending on what someone needs it for, if it's a small project and they don't want to pay for an enterprise solution, I would recommend it and give it a nine out of 10. The finicky things were a little frustrating, but the fact that it's free, can be deployed easily, and that it can fulfill a lot of things on a small scale, are plusses. If it were for a larger company that needed an enterprise solution, I wouldn't recommend it. In that case, it would be one out of 10.
For a smaller company or one with a smaller budget, a company that doesn't have highly complex ETL needs, Pentaho is definitely a great option. If a company has the budget and has really specific needs and large data sets, I would suggest looking elsewhere.
Disclosure: I am a real user, and this review is based on my own experience and opinions.