We are using just the simple features of this product.
We're using it as a data warehouse and then for building dimensions.
We are using just the simple features of this product.
We're using it as a data warehouse and then for building dimensions.
The shortcoming in version 7 is that we are unable to connect to Google Cloud Storage (GCS), where I can write the results from Pentaho. I'm able to connect to S3 using Pentaho 8, but when using it for GCS, I'm unable to connect. With people moving from on-premises deployments to the cloud, be it S3, Azure, or Google, we need a plugin where we can interact with these cloud vendors.
I would like to see improvements made for real-time data processing. It is something that I will be looking out for.
We have been using Pentaho Data Integration for three years.
For all of the features that we have been using, it is a stable product.
In terms of data loading and processes, the scalability is good.
We have a team of four people who are using it for analytics.
As we are using the Community Version, we have not been in contact with technical support. Instead, we rely on forums and websites when we need to resolve a problem.
In the past, I have worked with Talend, as well as SAP BO Data Services (BODS). However, that was with another company. This organization started with Pentaho and we are still using it.
It is a straightforward setup process. It took between three and four hours to complete.
We are using the Community Version, which is available free of charge.
The price of the regular version is not reasonable and it should be lower.
My advice for anybody who is researching this product is that if they want to do batch processing, then this is a good choice. The amount of data that it loads and processes is good.
Based on the features that I have used and my experience, I would rate this solution a seven out of ten.
We're using it for data warehousing. Typically, we collect data from numerous source systems, structure it, and then make it available to drive business intelligence, dashboard reporting, and things like that. That's the main use of it.
We also do a little bit of moving of data from one system to another, but the data doesn't go into the warehouse. For instance, we sync the data from one of our line of business systems into our support help desk system so that it has extra information there. So, we do a few point-to-point transfers, but mainly, it is for centralizing data for data warehousing.
We use it just as a data integration tool, and we haven't found any problems. When we have big data processing, we use Amazon Redshift. We use Pentaho to load the data into Redshift and then use that for big data processing. We use Tableau for our reporting platform. We've got quite a number of users who are experienced in it, so it is our chosen reporting platform. So, we use Pentaho for the data collection and data modeling aspect of things, such as developing facts and dimensions, but we then publicly export that data to Redshift as a database platform, and then we use Tableau as our reporting platform.
I am using version 8.3, which was the latest long-term support version when I looked at it the last time. Because this is something we use in production, and it is quite core to our operations, we've been advised that we just stick with the long-term support versions of the product.
It is in the cloud on AWS. It is running on an EC2 instance in AWS Cloud.
It enables us to create low-code pipelines without custom coding efforts. A lot of transformations are quite straightforward because there are a lot of built-in connectors, which is really good. It has got connectors to Salesforce, which makes it very easy for us to wire up a connection to Salesforce and scrape all of that data into another table. Their flows have got absolutely no code in them. It has a Python integrator, and if you want to go into a coding environment, you've got your choice of writing in Java or Python.
The creation of low-code pipelines is quite important. We have around 200 external data sets that we query and pull the data from on a daily basis. The low-code environment makes it easier for our support function to maintain it because they can open up a transformation and very easily see what that transformation is doing, rather than having to troll through reams and reams of code. ETLs written purely in code become very difficult to trace very quickly. You spend a lot of time trying to unpick it. They never get commented on as well as you'd expect, whereas, with a low-code environment, you have your transformation there, and it almost self documents itself. So, it is much easier for somebody who didn't write the original transformation to pick that up later on.
We reuse various components. For instance, we might develop a transformation that does a lookup based on the domain name to match to a consumer record, and then we can repeat that bit of code in multiple transformations.
We have a metadata-driven framework. Most of what we do is metadata-driven, which is quite important because that allows us to describe all of our data flows. For example, Table one moves to Table two, Table two moves to table three, etc. Because we've got metadata that explains all of those steps, it helps people investigate where the data comes from and allows us to publish reports that show, "You've got this end metric here, and this is where the data that drives that metric came from." The variable substitution that Pentaho has to allow metadata-driven frameworks is definitely a key feature that Pentaho offers.
The ability to automate data pipeline templates affects our productivity and costs. We run a lot of processes, and if it wasn't reliable, it would take a lot more effort. We would need a lot bigger team to support the 200 integrations that we run every day. Because it is a low-code environment, we don't have to have support instances escalated to the third line support to be investigated, which affects the cost. Very often our support analysts or more junior members are able to look into what an issue is and fix it themselves without having to escalate it to a more senior developer.
The automation of data pipeline templates affects our ability to scale the onboarding of data because after we've done a few different approaches and we get new requirements, they fit into a standard approach. It gives us the ability to scale with code and reuse, which also ties in with the metadata aspect of things. A lot of our intermediate stages of processing data are purely configured in metadata, so in order to implement transformation, no custom coding is required. It is really just writing a few lines of metadata to drive the process, and that gives us quite a big efficiency.
It has certainly reduced our ETL development time. I've worked at other places that had a similar-sized team to manage a system with a much lesser number of integrations. We've certainly managed to scale Pentaho not just for the number of things we do but also for the type of things we do.
We do the obvious direct database connections, but there is a whole raft of different types of integrations that we've developed over time. We have REST APIs, and we download data from Excel files that are hosted in SharePoint. We collect data from S3 buckets in Amazon, and we collect data from Google Analytics and other Google services. We've not come across anything that we've not been able to do with Pentaho. It has proved to be a very flexible way of getting data from anywhere.
Our time savings are probably quite significant. By using some of the components that we've already got written, our developers are able to, for instance, put in a transformation from a staging area to its model data area. They are probably able to put something in place in an hour or a couple of hours. If they were starting from a blank piece of paper, that would be several days worth of work.
The graphical nature of the development interface is most useful because we've got people with quite mixed skills in the team. We've got some very junior, apprentice-level people, and we've got support analysts who don't have an IT background. It allows us to have quite complicated data flows and embed logic in them. Rather than having to troll through lines and lines of code and try and work out what it's doing, you get a visual representation, which makes it quite easy for people with mixed skills to support and maintain the product. That's one side of it.
The other side is that it is quite a modular program. I've worked with other ETL tools, and it is quite difficult to get component reuse by using them. With tools like SSIS, you can develop your packages for moving data from one place to another, but it is really difficult to reuse a lot of it, so you have to implement the same code again. Pentaho seems quite adaptable to have reusable components or sections of code that you can use in different transformations, and that has helped us quite a lot.
One of the things that Pentaho does is that it has the virtual web services ability to expose a transformation as if it was a database connection; for instance, when you have a REST API that you want to be read by something like Tableau that needs a JDBC connection. Pentaho was really helpful in getting that driver enabled for us to do some proof of concept work on that approach.
Although it is a low-code solution with a graphical interface, often the error messages that you get are of the type that a developer would be happy with. You get a big stack of red text and Java errors displayed on the screen, and less technical people can get intimidated by that. It can be a bit intimidating to get a wall of red error messages displayed. Other graphical tools that are focused at the power user level provide a much more user-friendly experience in dealing with your exceptions and guiding the user into where they've made the mistake.
Sometimes, there are so many options in some of the components. Some guidance about when to use certain options embedded into the interface would be good so that people know that if they set something, what would it do, and when should they use an option. It is quite light on that aspect.
I have been using this solution since the beginning of 2016. It has been about seven years.
We haven't had any problems in particular that I can think of. It is quite a workhorse. It just sits there running reliably. It has got a lot to do every day. We have occasional issues of memory if some transformations haven't been written in the best way possible, and we obviously get our own bugs that we introduce into transformations, but generally, we don't have any problems with the product.
It meets our purposes. It does have horizontal scaling capability, but it is not something that we needed to use. We have lots of small-sized and medium-sized data sets. We don't have to deal with super large data sets. Where we do have some requirements for that, it works quite well. We can push some of that processing down onto our cloud provider. We've dealt with some of such issues by using S3, Athena, and Redshift. You can almost offload some of the big data processing to those platforms.
I've contacted them a few times. In terms of Lumada's ability to quickly and effectively solve issues that we brought up, we get a very good response rate. They provide very prompt responses and are quite engaging. You don't have to wait long, and you can get into a dialogue with the support team with back and forth emails in just an hour or so. You don't have to wait a week for each response cycle, which is something I've seen with some of the other support functions.
I would rate them an eight out of 10. We've got quite a complicated framework, so it is not possible for us to send the whole thing over for them to look into it, but they certainly give help in terms of tweaks to server settings and some memory configurations to try and get things going. We run a codebase that is quite big and quite complicated, so sometimes, it might be difficult to do something that you can send over to show what the errors are. They wouldn't log in and look at your actual environment. It has to be based on the log files. So, it is a bit abstract. If you have something that's occurring just on a very specific transformation that you've got, it might be difficult for them to drill into to see why it is causing a problem on our system.
I have a little bit of experience with AWS Glue. Its advantage is that it is tied natively into the AWS PySpark processing. Its disadvantage is that it writes some really difficult-to-maintain lines of code for all of its transformations, which might work fine if you have just a dozen or so transformations, but if you have a lot of transformations going on, it can be quite difficult to maintain.
We've also got quite a lot of experience working with SSIS. I much prefer Pentaho to SSIS. The SSIS ties you rigidly to your data flow structure that exists at design time, whereas Pentaho is very flexible. If, for instance, you wanted to move 15 columns to another table, in SSIS, you'd have to configure that with your 15 columns. If a 16th column appears, it would break that flow. With Pentaho, without amending your ETL, you can just amend your end data set to accept the 16th column, and it would just allow it to flow through. This and the fact that the transformation isn't tied down at the design time make it much more flexible than SSIS.
In terms of component reuse, other ETL tools are not nearly as good at being able to just pick up a transformation or a sub-transformation and drop it into your pipelines. You do tend to keep rewriting things again and again to get the same functionality.
I was here during the initial setup, but I wasn't involved in it. We used an external company. They do our upgrades, etc. The reason for that is that we tend to stick with just the long-term support versions of the product. Apart from service packs, we don't do upgrades very often. We never get a deep experience of that, so it is more efficient for us to bring in this external company that we work with to do that.
It is always difficult to quantify a return on investment for data warehousing and business intelligence projects. It is a cost center rather than a profit center, but if you take the starting point as this is something that needs to be done, you could pick up the tools to do it. In the long run, you would necessarily find that they are much cheaper. If you went for more of a coded approach, it might be cheaper in terms of licensing, but then you might have higher costs of maintaining that.
It does seem a bit expensive compared to the serverless product offering. Tools, such as Server Integration Services, are "almost" free with a database engine. It is comparable to products like Alteryx, which is also very expensive.
It would be great if we could use our enterprise license and distribute that to analysts and people around the business to use in place of Tableau Prep, etc, but its UI is probably a bit too confusing for that level of user. So, it doesn't allow us to get the tool as widely distributed across the organization to non-technical users as much as we would like.
I would advise taking advantage of using metadata to drive your transformations. You should take advantage of the very nice and easy way in which variable substitution works in a lot of components. If you use a metadata-driven framework in Pentaho, it will allow you to self-document your process flows. At some point, it always becomes a critical aspect of a project. Often, it doesn't crop up until a year or so later, but somebody always comes asking for proof or documentation of exactly what is happening in terms of how something is getting to here and how something is driving a metric. So, if you start off from the beginning by using a metadata framework that self documents that, you'll be 90% of the way in answering those questions when you need to.
We are satisfied with our decision to purchase Hitachi's products, services, or solutions. In the low-code space, they're probably reasonably priced. With the serverless architectures out there, there is some competition, and you can do things differently using serverless architecture, which would have an overall lower cost of running. However, the fact that we have so many transformations that we run, and those transformations can be maintained by a team of people who aren't Python developers or Java developers, and our apprentices can use this tool quite easily, is an advantage of it.
I'm not too familiar with the overall roadmap for Hitachi Vantara. We're just using the Pentaho data integration products. We don't use the metadata injection aspects of Pentaho mainly because we did have a need for them, but we know they're there.
I would rate it a seven out of 10. Its UI is a bit techy and more confusing than some of the other graphical ETL tools, and that's where improvements could be made.
We mainly use Lumada to load our operational systems into our data warehouse, but we also use it for monthly reporting out of the data warehouse, so it's to and from. We use some of Lumada's other features within the business to move data around. It's become quite the Swiss army knife.
We're primarily doing batch-type reports that go out. Not many people want to sift through data and pick it to join it in other things. There are a few, but again, I usually wind up doing it. The self-serve feature is not as big a seller to me because of our user base. Most of the people looking at it are salespeople.
Lumada has allowed us to interact with our employees more effectively and compensate them properly. One of the cool aspects is that we use it to generate commissions for our salespeople and bonuses for our warehouse people. It allows us to get information out to them in a timely fashion. We can also see where they're at and how they're doing.
The process that Lumada replaced was arcane. The sentiment among our employees, particularly the warehouse personnel, was that it was punitive. They would say, "I didn't get a bonus this month because the warehouse manager didn't like me." Now we can show them the numbers and say, "You didn't get a bonus because you were slacking off compared to everybody else." It's allowed us to be very transparent in how we're doing these tasks. Previously, that was all done behind the vest. I want people to trust the numbers, and these tools allow me to do that because I can instantly show that the information is correct.
That is a huge win for us. When we first rolled it out, I spent a third of my time justifying the numbers. Now, I rarely have to do that. It's all there, and they can see it, so they trust what the information is. If something is wrong, it's not a case of "Why is this being computed wrong?" It's more like: "What didn't report?"
We have 200 stores that communicate to our central hub each night. If one of them doesn't send any data, somebody notices now. That wasn't the case in the past. They're saying, "Was there something wrong with the store?" instead of, "There's something wrong with the data."
With Lumada's single end-to-end data management, we no longer need some of the other tools that we developed in-house. Before that, everything was in-house. We had a build-versus-buy mentality. It simplified many aspects that we were already doing and made that process quicker. It has made a world of difference.
This is primarily anecdotal, but there were times where I'd get an IM from one of the managers saying, "I'm looking at this in the sales meeting and calling out what somebody is saying. I want to make sure that this is what I'm seeing." I made a couple of people mad. Let's say they're no longer working for us, and we'll leave it at that. If you're not making somebody mad, you're not doing BI right. You're not asking the right questions.
Having a single platform for data management experience is crucial for me. It lets me know when something goes wrong from a data standpoint. I know when a load fails due to bad data and don't need to hunt for it. I've got a status board, so I can say, "Everything looks good this morning." I don't have to dig into it, and that has made my job easier.
What's more, I don't waste time arguing about why the numbers on this report don't match the ones on another because it's all coming from the same place. Before, they were coming from various places, and they wouldn't match for whatever reason. Maybe there's some piece of code in one report that isn't being accounted for in the other. Now, they're all coming from the same place. So everything is on the same level.
Some of the scheduling features about Lumada drive me buggy. The one issue that always drives me up the wall is when Daylight Savings Time changes. It doesn't take that into account elegantly. Every time it changes, I have to do something. It's not a big deal, but it's annoying. That's the one issue, but I see the limitation, and it might not be easily solvable.
I started working with Lumada long before it was acquired by Hitachi. It's been about 11 years now. I'm the primary person in the company who works with it. A few people know the solution tangentially. Aside from very basic elements, most tasks related to Lumada usually fall in my lap.
Lumada's stability and performance are pretty good. The limitations I run into are usually with the database that I'm trying to write to rather than read from. The only time I have a real issue is when an incredibly complex query takes 20 minutes to start returning data. It's sitting there going, "All right. Give me something to do." But then again, I've got it running on a machine that's got 64 gigs of memory.
Scaling out our processes hasn't been a big deal. We're a relatively small shop with only a couple of production databases. We're more of a regional enterprise, and I haven't had any issues with performance yet. It's always been some other product or solution that has gotten in the way. Lumada can handle anything we throw at it. Every night I run reports on our part ledger. That includes 200 million records, and Lumada can chew through it in about an hour and a half.
I know we can extend processing into the Spark realm if we need to. We've thought about that but never really needed it. It's something we keep in our back pocket. Someone suggested trying it out, but it never really got off the ground because other more pressing needs came up. From what I've seen, it'll scale out to whatever I need it to do. Any limitations are in the backend rather than the software. I've done some metrics on it. It's the database that I have to wait on more than the software. It's not doing a whole lot CPU-wise. My limitations are elsewhere, usually.
Right now, we have about 100 users working with Lumada. About 100 people log in to the system, but probably 200 people get reports from it. Only about 50 use the analysis tools, including the top sales managers and all of the buying group. There are also some analysts from various groups who use it constantly.
I'd give Lumada support a nine out of 10. It has been exceptional historically, but there was a rough patch about a year and a half ago shortly after Hitachi took over. They were in a transition period, but it has been very responsive since. I usually don't need help. When I do, I get a response the same day, and somebody's working on it. I'm not too worried about things going wrong, like an outage. I've never had that happen.
Sometimes when we do upgrades, and I'm in my test environment, I'll contact them and say, "I ran into this weird issue, and it's not doing what it should. What do you make of it?" They'll tell me, "You got to do this, that, and the other thing." They've been good about it.
Before Lumada, we had a variety of homegrown solutions. Most of it was centered on our warehouse management system because that was our primary focus. There were also reports within the point of sale system, and the two never crossed paths. Now they're integrated. There was also an analysis tool they had before I came on board. I can't remember the name of it. The company had something, but it didn't do what they thought it would do, and the project fizzled.
Part of the problem was that they didn't have somebody in-house who understood business intelligence until they brought me on. They were very operationally focused before that. The management was like, "We need more insight into what we're doing and how we're doing it." That was phase two of the big data warehouse push. The management here is relatively conservative in that regard, so they're somewhat slow to say, "Hey. We need to do something along these lines." But when they decide to go, get out of the way because here we come.
I used a different tool at my previous job called Informatica. Lumada has less of a learning curve for deployment. Lumada was similar enough to Informatica that it's like, "Okay. This makes sense," but there were a few differences. Once I figured out the difference, it made a lot of sense to me. The entire chain of steps Lumada allows you to do is intuitive.
Informatica was a lot more tedious to use. You had to hook every column up from its source to its target. With Lumada, it's the name that matters and its position. It made aspects a whole lot easier and less tedious. Every so often, it bites me in the butt. If I get a column out of order, it'll let me know I did something wrong. But it's much less error-prone because I don't have to hook every column up from its source to its target anymore. With Informatica, there were times where I spent 20 minutes just sitting there trying not to drool on myself. It was terrible.
Setting up Lumada was pretty straightforward. We just rolled it out and went from proof of concept to live in about a year. I was relatively new to the organization at the time and was still getting a feel for it — knowing where data was and what all these things mean. My experience at a shoe company didn't exactly translate to an auto parts business. I went to classes down in Orlando to learn the product, then we went from there and just tried it. We had a few faux pas here and there, but we knew.
Lumada has also significantly reduced our ETL development time. It depends on the project, but if someone comes to me with a new data source, I can typically integrate it within a week, whereas it used to take a month. It's a 4-to-1 reduction. It's allowed our IT department to stay lean. I worked at another company with 70 IT people, 50 of which were programmers. My current workplace has 12 people, and six are programmers. The others are UI-type developers, and there are about six database people, including me. We save the equivalent of a full-time employee, so that's anywhere from $50,000 to $75,000 a year.
I think Lumada's price is fair compared to some of the others, like BusinessObjects, which is was the other solution that I used at my previous job. BusinessObject's price was more reasonable before SAP acquired it. They jacked the price up significantly. Oracle's OBIEE tool was also prohibitively expensive. We felt the value was much greater than the cost, and the value for the money was much better than if we had gone with other solutions.
We didn't consider other options besides Lumada because we are members of an auto parts trade association, and they were using the Pentaho tool before it was Hitachi to do some ETL tasks. They recommended it, so we started using it. I evaluated a couple of other ones, but they cost more than we were willing to spend to try out this type of solution. Once we figured out what it could do for us, then it's like, "Okay. Now, we can do some real work here."
I rate Lumada nine out of 10. The aspect I like about Lumada is its flexibility. I can make it do pretty much whatever I want. It's not perfect, but I haven't run into a tool that is yet. I haven't used every aspect of it, but there's very little that I can't make it do. I haven't run into a scenario where it couldn't handle a challenge we put in front of it. It's been a solid performer for us. I rarely have a problem that is due to Lumada. The issues I have with my loads are never because of the software.
If you plan to implement Lumada, I recommend going to the classes. Don't be afraid to ask dumb questions of support because many of them used to be consultants. They've all been there, done that. One of the guys I talk to regularly lives about 80 miles to the north of me. I have a rapport with him. They're willing to go above and beyond to make you successful.
We use Pentaho for small ETL integration jobs and cross-storage analytics. It's nothing too major. We have it deployed on-premise, and we are still on the free version of the product.
In our case, processing takes place on the virtual machine where we installed Pentaho. We can ingest data from different on-premises and cloud locations. We still don't carry out the data processing phase inside a different environment from where the VM is running.
At the start of my team's journey at the company, it was difficult to do cross-platform storage analytics. That means ingesting data from different analytics sources inside a single storage machine and building out KPIs and some other analytics.
Pentaho was a good start because we can create different connections and import data. We can then do some global queries on that data from various sources. We've been able to replace some of our other data tools like Talend for our managing data warehouse workflow. Later, we adopted some other cloud technologies, so we don't primarily use Pentaho for those use cases anymore.
Pentaho is flexible with a drag-and-drop interface that makes it easier to use than some other ETL products. For example, the full stack we are using in AWS does not have drag-and-drop functionality. Pentaho was a good option at the start of this journey.
We can schedule job execution in the BA Server, which is the front-end product we're using right now. That scheduling interface is nice.
It's difficult to use custom code. Implementing a pipeline with pre-built blocks is straightforward, but it's harder to insert custom code inside the pre-built blocks. The web interface is rusty, and the biggest problem with Pentaho is debugging and troubleshooting. It isn't easy to build the pipeline incrementally. At least in our case, it's hard to find a way to execute step by step in the debugging mode.
Repository management is also a shortcoming, but I'm not sure if that's just a limitation of the free version. I'm not sure if Pentaho can use an external repository. It's a flat-file repository inside a virtual machine. Back in the day, we would want to deploy this repository on a database.
Pentaho's data management covers ingestion and insights but I'm not sure if it's end-to-end management—at least not in the free version we are using—because some of the intermediate steps are missing, like data cataloging and data governance features. This is the weak spot of our Pentaho version.
We implemented Hitachi Pentaho some time ago. We have been using it for around five or six years. I was using the product at the time, but now I am the head of the data engineering team, so I don't use it anymore but I know Pentaho's strengths and weaknesses.
Pentaho is relatively stable, but I average about one failed job every month.
I rate Pentaho six out of 10 for scalability. The scalability depends on how you deploy it. In our case, the on-premise virtual machine is relatively small and doesn't have a lot of resources. That is why Pentaho does not handle big datasets well in our case.
I'm also unsure if we can deploy Pentaho in the cloud. So when you're not dealing with the cloud, scalability is always limited. We cannot indefinitely pump resources into a virtual machine.
Currently, we have five or six active workflows running each night. Some of them are ingesting data from ADU. Others take data from AWS Redshift or on-premise Oracle. In terms of people, three other people on the data engineering team and I are actively using Pentaho.
We used Talend, which is a Java-based solution and is made for people with proficiency in Java. The entire analytics ecosystem is transitioning to more flexible runtimes, including Python and other languages. Java was not ideal for our data analytics journey.
Right now, we are using NiFi, a tool in the cloud ecosystem that has a similar drag-and-drop interface, but it's embedded in the ADU framework. We're also using another drag-and-drop tool on AWS, but not AWS Glue Studio.
We've seen a 50 percent reduction in our ETL development time using the free version of Pentaho. That saves about 1,000 euros per week, so at least 50,000 euros annually.
I rate Pentaho eight out of 10. It's a perfect pick for data teams that are getting started and more business-oriented data teams. It's good for a data analyst who isn't so tech-savvy. It is flexible and easy to use.
We are a service delivery enterprise, and we have different use cases. We deliver solutions to other enterprises, such as banks. One of the use cases is for real-time analytics of the data we work with. We take CDC data from Oracle Database, and in real-time, we generate a product offer for all the products of a client. All this is in real-time. The client could be at the ATM or maybe at an agency, and they can access the product offer.
We also use Pentaho within our organization to integrate all the documents and Excel spreadsheets from our consultants and have a dashboard for different hours for different projects.
In terms of version, currently, Pentaho Data Integration is on version 9, but we are using version 8.2. We have all the versions, but we work with the most stable one.
In terms of deployment, we have two different types of deployments. We have on-prem and private cloud deployments.
I work with a lot of data. We have about 50 terabytes of information, and working with Pentaho Data Integration along with other databases is very fast.
Previously, I had three people to collect all the data and integrate all Excel spreadsheets. To give me a dashboard with the information that I need, it took them a day or two. Now, I can do this work in about 15 minutes.
It enables us to create pipelines with minimal manual coding or custom coding efforts, which is one of its best features. Pentaho is one of the few tools with which you can do anything you can imagine. Our business is changing all the time, and it is best for our business if I can use less time to develop new pipelines.
It provides the ability to develop and deploy data pipeline templates once and reuse them. I use them at least once a day. It makes my daily life easier when it comes to data pipelines.
Previously, I have used other tools such as Integration Services from Microsoft, Data Services for SAP, and Informatica. Pentaho reduces the ETL implementation time by 5% to 50%.
Pentaho from Hitachi is a suite of different tools. Pentaho Data Integration is a part of the suite, and I love the drag-and-drop functionality. It is the best.
Its drag-and-drop interface lets me and my team implement all the solutions that we need in our company very quickly. It's a very good tool for that.
Their client support is very bad. It should be improved. There is also not much information on Hitachi forums or Hitachi web pages. It is very complicated.
In terms of the flexibility to deploy in any environment, such as on-premise or in the cloud, we can do the cloud deployment only through virtual machines. We might also be able to work on different environments through Docker or Kubernetes, but we don't have an Azure app or an AWS app for easy deployment to the cloud. We can only do it through virtual machines, which is a problem, but we can manage it. We also work with Databricks because it works with Spark. We can work with clustered servers, and we can easily do the deployment in the cloud. With a right-click, we can deploy Databricks through the app on AWS or Azure cloud.
I have been using Pentaho Data Integration for 12 years. The first version that I tested and used was 3.2 in 2010.
Their technical support is not good. I would rate them 2 out of 10 because they don't have good technical skills to solve problems.
It is very quick and simple. It takes about five minutes.
I have a good knowledge of this solution, and I would highly recommend it to a friend or colleague.
It provides a single, end-to-end data management experience from ingestion to insights, but we have to create different pipelines to generate the metadata management. It's a little bit laborious to work with Pentaho, but we can do that.
I've heard a lot of people say it's complicated to use, but Pentaho is one of the few tools where you can do anything you can imagine. It is very good and quite simple, but you need to have the right knowledge and the right people to handle the tool. The skills needed to create a business intelligence solution or a data integration solution with Pentaho are problem-solving logic and maybe database knowledge. You can develop new steps, and you can develop new functionality in Pentaho Lumada, but you must have the knowledge of advanced Java programming. Our experience, in general, is very good.
Overall, I am satisfied with our decision to purchase Hitachi's product services and solutions. My satisfaction level is at an eight out of ten.
I am not much aware of the roadmap of Hitachi Vantara. I don't read much about that.
I would rate this solution an eight out of ten.
My work primarily revolves around data migration and data integration for different products. I have used them in different companies, but for most of our use cases, we use it to integrate all the data that needs to flow into our product. Also, we can have outbound from our product when we need to send to different, various integration points. We use this product extensively to build ETLs for those use cases.
We are developing ETLs for the inbound data into the product as well as outbound to various integration points. Also, we have a number of core ETLs written on this platform to enhance our product.
We have two different modes that we offer: one is on-premises and the other is on the cloud. On the cloud, we have an EC2 instance on AWS, then we have installed that EC2 instance and we call it using the ETL server. We also have another server for the application where the product is installed.
We use version 8.3 in the production environment, but in the dev environment, we use version 9 and onwards.
We have been able to reduce the effort required to build sophisticated ETLs. Also, we now are in the migration phase from an on-prem product to a cloud-native application.
We use Lumada’s ability to develop and deploy data pipeline templates once and reuse them. This is very important. When the entire pipeline is automated, we do not have any issues in respect to deployment of code or with code working in one environment but not working in another environment. We have saved a lot of time and effort from that perspective because it is easy to build ETL pipelines.
The metadata injection feature is the most valuable because we have used it extensively to build frameworks, where we have used it to dynamically generate code based on different configurations. If you want to make a change at all, you do not need to touch the actual code. You just need to make some configuration changes and the framework will dynamically generate code for that as per your configuration.
We have a UI where we can create our ETL pipelines as needed, which is a key advantage for us. This is very important because it reduces the time to develop for a given project. When you need to build the whole thing using code, you need to do multiple rounds of testing. Therefore, it helps us to save some effort on the QA side.
Hitachi Vantara's roadmap has a pretty good list of features that they have been releasing with every new version. For instance, in version 9, they have included metadata injection for some of the steps. The most important elements of this roadmap to our organization’s strategy are the data-driven approach that this product is taking and the fact that we have a very low-code platform. Combining these two is what gives us the flexibility to utilize this software to enhance our product.
It could be better integrated with programming languages, like Python and R. Right now, if I want to run a Python code on one of my ETLs, it is a bit difficult to do. It would be great if we have some modules where we could code directly in a Python language. We don't really have a way to run Python code natively.
I have been working with this tool for five to six years.
They are making it a lot more stable. Earlier, stability used to be an issue when it was not with Hitachi. Now, we don't see those kinds of issues or bugs within the platform because it has become far more stable. Also, we see a lot of new big data features, such as connecting to the cloud.
Lumada is flexible to deploy in any environment, whether on-premises or the cloud, which is very important. When we are processing data in batches on certain days, e.g., at the end of the week or month, we might have more data and need more processing power or RAM. However, most times, there might be very minimal usage of that CPU power. In that way, the solution has helped us to dynamically scale up, then scale down when we see that we have more data that we need to process.
The scalability is another key advantage of this product versus some of the others in the market since we can tweak and modify a number of parameters. We are really impressed with the scalability.
We have close to 80 people who are using this product actively. Their roles go all the way from junior developers to support engineers. We also have people who have very little coding knowledge and are more into the management side of things utilizing this tool.
I haven't been part of any technical support discussions with Hitachi.
We are very satisfied with our decision to purchase Hitachi's product. Previously, we were using another ETL service that had a number of limitations. It was not a modern ETL service at all. For anything, we had to rely on another third-party software. Then, with Hitachi Lumada, we don't have to do that. In that way, we are really satisfied with the orchestration or cloud-native steps that they offer. We are really happy on those fronts.
We were using something called Actian Services, which had less features and it ended up costing more than the enterprise edition of Pentaho.
We could not do a number of things on Actian. For instance, we were unable to call other APIs or connect to an S3 bucket. It was not a very modern solution. Whereas, with Pentaho, we could do all these things as well as have great marketplaces where we could find various modules and third-party plugins. Those features were simply not there in the other tool.
The initial setup was pretty straightforward.
We did not have any issues configuring it, even in my local machine. For the enterprise edition, we have a separate infrastructure team doing that. However, for at least the community edition, the deployment is pretty straightforward.
We have seen at least 30% savings in terms of effort. That has helped us to price our service and products more aggressively in the market, helping us to win more clients.
It has reduced our ETL development time. Per project, it has reduced by around 30% to 35%.
We can price more aggressively. We were actually able to win projects because we had great reusability of ETLs. A code that was used for one client can be reused with very minimal changes. We didn't have any upfront cost for kick-starting projects using the Community edition. It is only the Enterprise edition that has a cost.
For most development tasks, the Enterprise edition should be sufficient. It depends on the type of support that you require for your production environment.
We did evaluate SSIS since our database is based on Microsoft SQL server. SSIS comes with any purchase of an SQL Server license. However, even with SSIS, there were some limitations. For example, if you want to build a package and reuse it, SSIS doesn't provide the same kinds of abilities that Pentaho does. The amount of reusability reduces when we try to build the same thing using SSIS. Whereas, in Pentaho, we could literally reuse the same code by using some of its features.
SSIS comes with the SQL Server and is easier to maintain, given that there are far more people who would have knowledge of SSIS. However, if I want to do a PCP encryption or make an API connection, it is difficult. To create a reusable package is not that easy, which would be the con for SSIS.
The query performance depends on the database. It is more likely to be good if you have a good database server with all the indexes and bells and whistles of a database. However, from a data integration tool perspective, I am not seeing any issues with respect to query performance.
We do not build visualization features that much with Hitachi. For the reporting purposes, we have been using one of the tools from the product, then prepare the data accordingly.
We use this for all the projects that we are currently running. Going forward, we will be sticking only to using this ETL tool.
We haven't had any roadblocks using Lumada Data Integration.
On a scale of one to 10, I would recommend Hitachi Vantara to a friend or colleague as a nine.
If you need to build ETLs quickly in a low-code environment, where you don't want to spend a lot of time on the development side of things but it is a little difficult to find resources, then train them in this product. It is always worth that effort because it ends up saving a lot of time and resources on the development side of projects.
Overall, I would rate the product as a nine out of 10.
Our primary use case is to populate a data warehouse and data marts, but we also use it for all kinds of data integration scenarios and file movement. It is almost like middleware between different enterprise solutions. We take files from our legacy app system, do some work on them, and then call SAP BAPIs, for example.
It is deployed on-premises. It gives you the flexibility to deploy it in any environment, whether on-premises or in the cloud, but this flexibility is not that important to us. We could deploy it on the cloud by spinning up a new server in AWS or Azure, but as a manufacturing facility, it is not important to us. Our customer preference is primarily to deploy things on-premises.
We usually stay one version behind the latest one. We're a manufacturing facility. So, we're very sensitive to any bugs or issues. We don't do automatic upgrades. They're a fairly manual process.
We've had it for a long time. So, we've realized a lot of the improvements that anybody would realize from almost any data integration product.
The speed of developing solutions has been the best improvement. It has reduced the development time and improved the speed of getting solutions deployed. The reduced ETL development time varies by the size and complexity of the project. We probably spend days or weeks less than then if we were using a different tool.
It is tremendously flexible in terms of adding custom code by using a variety of different languages if you want to, but we had relatively few scenarios where we needed it. We do very little custom coding. Because of the tool we're using, it is not critical. We have developed thousands of transformations and jobs in the tool.
It makes it pretty simple to do some fairly complicated things. Both I and some of our other BI developers have made stabs at using, for example, SQL Server Integration Services, and we found them a little bit frustrating compared to Data Integration. So, its ease of use is right up there.
Its performance is a pretty close second. It is a pretty highly performant system. Its query performance on large data sets is very good.
Its basic functionality doesn't need a whole lot of change. There could be some improvement in the consistency of the behavior of different transformation steps. The software did start as open-source and a lot of the fundamental, everyday transformation steps that you use when building ETL jobs were developed by different people. It is not a seamless paradigm. A table input step has a different way of thinking than a data merge step.
We have been using this solution for more than 10 years.
Its stability is very good.
Its scalability is very good. We've been running it for a long time, and we've got dozens, if not hundreds, of jobs running a day.
We probably have 200 or 300 people using it across all areas of the business. We have people in production control, finance, and what we call materials management. We have people in manufacturing, procurement, and of course, IT. It is very widely and extensively used. We're increasing its usage all the time.
They are very good at quickly and effectively solving the issues we have brought up. Their support is well structured. They're very responsive.
Because we're very experienced in it, when we come to them with a problem, it is usually something very obscure and not necessarily easy to solve. We've had cases where when we were troubleshooting issues, they applied just a remarkable amount of time and effort to troubleshoot them.
Support seems to have very good access to development and product management as a tier-two. So, it is pretty good. I would give their technical support an eight out of ten.
We didn't have another data integration product before Pentaho.
I installed it. It was straightforward. It took about a day and a half to get the production environment up and running. That was probably because I was e-learning as I was going. With a services engagement, I bet you would have everything up in a day.
We used Pentaho services for two days. Our experience was very good. We worked with Andy Grohe. I don't know if he is still there or not, but he was excellent.
We have absolutely seen an ROI, but I don't have the metrics. There are analytic cases that we just weren't able to do before. Due to the relatively low cost compared to some of the other solutions out there, it has been a no-brainer.
We did a two or three-year deal the last time we did it. As compared to other solutions, at least so far in our experience, it has been very affordable. The licensing is by component. So, you need to make sure you only license the components that you really intend to use.
I am not sure if we have relicensed after the Hitachi acquisition, but previously, multi-year renewals resulted in a good discount. I'm not sure if this is still the case.
We've had the full suite for a lot of years, and there is just the initial cost. I am not aware of any additional costs.
If you haven't used it before, it is worth engaging services with Pentaho for initial implementation. They'll just point out a number of small foibles related to perhaps case sensitivity. They'll just save you a lot of runs through the documentation to identify different configuration points that might be relevant to you.
I would highly recommend the Data Integration product, particularly for anyone with a Java background. Most of our BI developers at this point do not have a Java background, which isn't really that important. Particularly, if you're a Java business and you're looking for extensibility, the whole solution is built in Java, which just makes certain aspects of it a little more intuitive at first.
On the data integration side, it is really a good tool. A lot of investment dollars go into big data and new tech, and often, those are not very compelling for us. We're in an environment where we have medium data, not big data.
It provides a single end-to-end data management experience from ingestion to insights, but at this point, that's not critical to us. We mostly do the data integration work in Pentaho, and then we do the visualization in another tool. The single data management experience hasn't enabled us to discontinue the use of other data management analysis delivery tools just because we didn't really have them.
We take an existing job or transformation and use that as a test. It is certainly easy enough to copy one object to another. I am not aware of a specific templating capability, but we are not really missing anything there. It is very easy for us to clone a job or transformation just by doing a Save As, and we do that extensively.
Vantara's roadmap is a little fuzzy for me. There has been quite a bit of turnover in the customer-facing roles over the last five years. We understand that there is a roadmap to move to a pure web-based solution, but it hasn't been well communicated to us.
In terms of our decision to purchase Hitachi's product services or solutions, our satisfaction level is average or on balance.
I would rate this solution a seven out of ten.