I have used the solution to gather data from multiple sources, including APIs, databases like Oracle, and web servers. There are a bunch of data providers available who can provide you with datasets to export in JSON format from clouds or APIs.
BI developer at a insurance company with 1,001-5,000 employees
Offers features for data integration and migration
Pros and Cons
- "The product is user-friendly and intuitive"
- "The solution offers features for data integration and migration. Pentaho Data Integration and Analytics allows the integration of multiple data sources into one. The product is user-friendly and intuitive to use for almost any business."
- "Should provide additional control for the data warehouse"
What is our primary use case?
What is most valuable?
The solution offers features for data integration and migration. Pentaho Data Integration and Analytics allows the integration of multiple data sources into one. The product is user-friendly and intuitive to use for almost any business.
What needs improvement?
The solution should provide additional control for the data warehouse and reduce its size, as our organization's clients have expressed concerns regarding it. The vendor can focus on reducing capacity and compensate for it by enhancing product efficiency.
For how long have I used the solution?
I have been using Pentaho Data Integration and Analytics for a year.
Buyer's Guide
Pentaho Data Integration and Analytics
January 2026
Learn what your peers think about Pentaho Data Integration and Analytics. Get advice and tips from experienced pros sharing their opinions. Updated: January 2026.
881,515 professionals have used our research since 2012.
How are customer service and support?
I have never encountered any issues with Pentaho Data Integration and Analytics.
What's my experience with pricing, setup cost, and licensing?
I believe the pricing of the solution is more affordable than the competitors.
Which other solutions did I evaluate?
I have worked with IBM DataStage along with Pentaho Data Integration and Analytics. The found the IBM DataStage interface to seem outdated in comparison to the Pentaho tool. IBM DataStage demands the user to drag and drop the services as well as the pipelines, similar to the process in SSIS platforms. Pentaho Data Integration and Analytics is also easier to comprehend from the first use than IBM DataStage.
What other advice do I have?
The solution's ETL capabilities make data integration tasks easier and are used to export data from a source to a destination. At my company, I am using IBM data switches and the overall IBM tech stack for compatibility among the integrations, pipelines and user levels.
I would absolutely recommend Pentaho Data Integration and Analytics to others. I would rate the solution a seven out of ten.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Senior Data Engineer at a financial services firm with 201-500 employees
Low-code makes development faster than with Python, but there were caching issues
Pros and Cons
- "The fact that it's a low-code solution is valuable. It's good for more junior people who may not be as experienced with programming."
- "If you're working with a larger data set, I'm not so sure it would be the best solution. The larger things got the slower it was."
What is our primary use case?
We used it for ETL to transform data from flat files, CSV files, and database. We used PostgreSQL for the connections, and then we would either import it into our database if the data was in from clients, or we would export it to files if clients wanted files or if a vendor needed to import the files into their database.
How has it helped my organization?
The biggest benefit is that it's a low-code solution. When you hire junior ETL developers or engineers, who may have a schooling background but no real experience with ETL or coding for ETL, it's a UI-based, low-code solution in which they can make something happen within weeks instead of, potentially, months.
Because it's low-code, while I could technically have done everything in Python alone, that would definitely have taken longer than using Pentaho. In addition, by being able to standardize pipelines to handle the onboarding process for new clients, development costs were significantly reduced. To put in perspective, prior to my leading the effort to standardize things, it would typically take about a week to build a feed from start to finish, and sometimes more depending on how complicated it was. With this solution, instead of it taking a week, it was reduced to an afternoon, or about three hours. That was a significant difference.
Instead of paying a developer a full week's worth of work, which could be $2,500 or more, it cut it down to three hours or about $300. That's a big difference.
What is most valuable?
The fact that it's a low-code solution is valuable. It's good for more junior people who may not be as experienced with programming. In our case, we didn't have a huge data set. We had small and medium-sized data sets, so it worked fine.
The fact that it's open source is also helpful in that, if a junior engineer knows they are going to use it in a job, they can download it themselves, locally, for free, and use test data to learn it.
My role was to use it to write one feed that could facilitate multiple clients. Given that it was an open-source, free solution, it was pretty robust in what it could do. I could make lookup tables and databases and map different clients, and I could use the same feed for 30 clients or 50 clients. It got the job done for our use case.
In addition, you can install it wherever you need it. We had installed versions in the cloud and I also had local versions.
What needs improvement?
If you're working with a larger data set, I'm not so sure it would be the best solution. The larger things got the slower it was.
It was kind of buggy sometimes. And when we ran the flow, it didn't go from a perceived start to end, node by node. Everything kicked off at once. That meant there were times when it would get ahead of itself and a job would fail. That was not because the job was wrong, but because Pentaho decided to go at everything at once, and something would process before it was supposed to. There were nodes you could add to make sure that, before this node kicks off, all these others have processed, but it was a bit tedious.
There were also caching issues, and we had to write code to clear the cache every time we opened the program, because the cache would fill up and it wouldn't run. I don't know how hard that would be for them to fix, or if it was fixed in version 10.
Also, the UI is a bit outdated, but I'm more of a fan of function over how something looks.
One other thing that would have helped with Pentaho was documentation and support on the internet: how to do things, how to set up. I think there are some sites on how to install it, and Pentaho does have a help repository, but it wasn't always the most useful.
For how long have I used the solution?
I used Hitachi Lumada Data Integration (Pentaho) for three years
What do I think about the stability of the solution?
In terms of the stability of the solution, as I noted, I wouldn't use it for large data sets. But for small to midsize companies that are looking for a low-code solution that isn't going to break the budget, it's a great tool for them to use.
It worked and it was stable enough, once we figured out the little quirks and how to get around them. It mostly handled our production workflows without issue.
What do I think about the scalability of the solution?
I think it could scale, but only up to a point. I didn't test it on larger datasets. But after talking to people who have worked on larger datasets, they wouldn't recommend using it, but that is hearsay.
In my former company, there were about five people in the data engineering department who were using the solution in their roles as ETL data integration Specialists.
In that company, it's their go-to solution and I think it will work for everything that they need. When I was there, I tried opening pathways to different things, but there were so many feeds already on it, and it worked for what they need, and it's low-code and open source, so I think they'll stick with it. As they gain more clients they'll increase their usage of it.
How was the initial setup?
The initial setup wasn't that complicated. You have to set the job environment variables and that was probably the most complicated part, and would be especially so if you're not familiar with it. Otherwise, it was just a matter of downloading the version needed, installing it, and learning how to use the different components. Overall, it was pretty easy and straightforward.
The first time we deployed it, not knowing what we were doing, it took a couple of days, but that was mainly troubleshooting and figuring out what we were doing wrong because we hadn't used it before. After that, it would take maybe 30 minutes or an hour.
In terms of maintenance for Pentaho, one developer per feed is what is typically assigned. It will depend on the workflow of the company and how many feeds are needed. In our case there were five people involved.
What was our ROI?
It saved us a lot of money. Given that it's open source, and the amount of time over the three that I used it, and the fact that they were using it several years prior, means a lot of money was definitely saved by using Pentaho versus something else.
What's my experience with pricing, setup cost, and licensing?
If a company is looking for an ETL solution and wants to integrate it with their tech stack but doesn't want to spend a bunch of money, Pentaho is a good solution. SSIS cores were $10,000 a piece. Although I don't know what they cost nowadays, they're expensive.
Pentaho is a nice option without having to pay an arm and a leg. We even had a complicated data set and Pentaho was able to handle pretty much every type of scenario, if we thought about it creatively enough. I would recommend it for a company in that position.
Which other solutions did I evaluate?
While the capabilities of Pentaho are good enough for light work, I've started using Alteryx Designer, and it is so much more robust in everything that you can do in real time. I've also used SSIS.
When you run something in Pentaho, you can click on it to see the output of each one, but it's hard to really change anything. For example, if I were to query data from a database and put it into a "select," if I wanted to reorganize within the select based on something like the first initial of someone's name, it provided that option. But when I would do it, sometimes it would throw an error and I'd have to run the feed again to see it.
The nodes, or the components, in Pentaho can probably do about 70 percent of what you can do in Alteryx. Don't get me wrong, Pentaho worked for what we needed it for, with just a few quirks. But as a data engineer, I'm always interested in and excited to work with new technologies that may offer different benefits. In this case, one of the benefits is that each node in Alteryx has many more capabilities in real time. I can look at the data that's coming into the node and the data that's going out. There was a way to do that in Pentaho, if you right-clicked and looked, but it would tell you the fields that were coming in and out and not necessarily the data. It's nice to be able to troubleshoot, on the spot, node-by-node, if you're having an issue. You can do that easily with Alteryx.
In addition to being able to look at data coming in and out of the node, you can also sort it easily and filter it within each data node in Alteryx, and that is something you can't do in Pentaho.
Another cool thing with Alteryx, although it's a very small difference, is that you don't have to save the workflow before you run it. Pentaho forces you to do that. Of course, it's always good to save.
What other advice do I have?
A good thing about Pentaho is that it's not that hard to learn, from an ETL perspective. The way that Pentaho has things laid out they are pretty intuitively organized in the panel: Your input—flat file, CSV, or database—and then the transformation nodes.
It was a good baseline and a good open-source tool to use to learn ETL. It's good to have exposure to multiple tools because every company has different needs and, depending on their needs, it would be a different recommendation.
The lessons I learned using it: Make sure you clear the cache when you open the program. Also, if there are any critical points in your flow that are dependent upon previous nodes, make sure that you put blocking steps in. Make sure you also set up the job environment variables correctly, so that Pentaho runs.
It worked for what we did but, personally, I wouldn't use it. In the new company I'm working for, we are using large financial data sets and I'm not so sure it could handle that. I know there's an Enterprise version, but I didn't use that.
The solution can handle ingestion through to export, but you still have to have a batch or Python script to run it with an automation process. I don't know if the Lumada version has something different, but with what I was using, you were simply building the pipeline, but the pipeline outside of the program had to be scheduled and run, and we had other tools to check that the output was as expected.
We used version 7 for a while and we were reluctant to upgrade to version 9 because we had an 834 configuration, meaning a government standardized feed that our developer spent two years building. There was an issue whenever we tried to run those feeds on version 9, so we were reluctant to upgrade because things were working on 7. We ended up finding out that it didn't take much work for us to fix the problem that we were having with version 9 and, eventually, we moved to it. With every version upgrade of anything, there are going to be pros and cons.
Depending on what someone needs it for, if it's a small project and they don't want to pay for an enterprise solution, I would recommend it and give it a nine out of 10. The finicky things were a little frustrating, but the fact that it's free, can be deployed easily, and that it can fulfill a lot of things on a small scale, are plusses. If it were for a larger company that needed an enterprise solution, I wouldn't recommend it. In that case, it would be one out of 10.
For a smaller company or one with a smaller budget, a company that doesn't have highly complex ETL needs, Pentaho is definitely a great option. If a company has the budget and has really specific needs and large data sets, I would suggest looking elsewhere.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Buyer's Guide
Pentaho Data Integration and Analytics
January 2026
Learn what your peers think about Pentaho Data Integration and Analytics. Get advice and tips from experienced pros sharing their opinions. Updated: January 2026.
881,515 professionals have used our research since 2012.
Assosiate Partner at a tech services company with 51-200 employees
Efficient data integration with cost savings but may be less efficient
Pros and Cons
- "It is easy to use, install, and start working with."
- "Larger data jobs take more time to execute."
What is our primary use case?
I have a team who has experience with integration. We are service providers and partners. Generally, clients buy the product directly from the company.
How has it helped my organization?
It is easy to use, install, and start working with. This is one of the advantages compared to other key vaulting products. The relationship between price and functionality is excellent, resulting in time and money savings of between twenty-five and thirty percent.
What is most valuable?
One of the advantages is that it is easy to use, install, and start working with. For certain volumes of data, the solution is very efficient.
What needs improvement?
Pentaho may be less efficient for large volumes of data compared to other solutions like Talend or Informatica. Larger data jobs take more time to execute.
Pentaho is more appropriate for jobs with smaller volumes of data.
For how long have I used the solution?
I have used the solution for more than ten years.
What do I think about the stability of the solution?
The solution is stable. Generally, one person can manage and maintain it.
What do I think about the scalability of the solution?
Sometimes, for large volumes of data, a different solution might be more appropriate. Pentaho is suited for smaller volumes of data, while Talend is better for larger volumes.
How are customer service and support?
Based on my experience, the solution has been reliable.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
We did a comparison between Talend and Pentaho last year.
How was the initial setup?
The initial setup is straightforward. It is easy to install and start working with.
What about the implementation team?
A team with experience in integration manages the implementation.
What was our ROI?
The relationship between price and functionality is excellent. It results in time and money savings of between twenty-five and thirty percent.
What's my experience with pricing, setup cost, and licensing?
Pentaho is cheaper than other solutions. The relationship between price and functionality means it provides good value for money.
Which other solutions did I evaluate?
We evaluated Talend and Pentaho.
What other advice do I have?
I'd rate the solution seven out of ten.
Which deployment model are you using for this solution?
On-premises
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Other
Disclosure: My company has a business relationship with this vendor other than being a customer. MSP
Senior Product Manager at a retailer with 10,001+ employees
Loads data into the required tables and can be plug-and-played easily
What is our primary use case?
The use cases involve loading the data into the required tables based on the transformations. We do a couple of transformations, and based on the business requirement, we load the data into the required tables.
What is most valuable?
It's a very lightweight tool. It can be plug-and-played easily and read data from multiple sources. It's a very good tool for small to large companies. People or customers can learn very easily to do the transformations for loading and migrating data. It's a fantastic tool in the open-source community.
When compared to other commercial ETL tools, this is a free tool where you can download and do multiple things that the commercial tools are doing. It's a pretty good tool when compared to other commercial tools. It's available in community and enterprise editions. It's very easy to use.
What needs improvement?
It is difficult to process huge amounts of data. We need to test it end-to-end and conclude how much is the processing of data. If it is an enterprise edition, we can process the data.
For how long have I used the solution?
I have been using Pentaho Data Integration and Analytics for 11-12 years.
What do I think about the stability of the solution?
We process a small amount of data, but it's pretty good.
What do I think about the scalability of the solution?
It's scalable across any machine,
How are customer service and support?
Support is satisfactory. A few of my colleagues are also there, working with Hitachi to provide solutions whenever a ticket or Jira is raised for them.
How would you rate customer service and support?
Positive
How was the initial setup?
Installation is very simple. When you go to the community and enterprise edition, it's damn simple. Even you can install it very easily.
One person is enough for the installation
What's my experience with pricing, setup cost, and licensing?
The product is quite cheap.
What other advice do I have?
It can quickly implement slowly changing dimensions and efficiently read flat files, loading them into tables quickly. Additionally, "several copies to the stat h enables parallel partitioning. In the Enterprise Edition, you can restart your jobs from where they left off, a valuable feature for ensuring continuity. Detailed metadata integration is also very straightforward, which is an advantage. It is lightweight and can work on various systems.
Any technical guy can do everything end to end.
Overall, I rate the solution a ten out of ten.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Solution Integration Consultant II at a tech vendor with 201-500 employees
Reduces the effort required to build sophisticated ETLs
Pros and Cons
- "We use Lumada’s ability to develop and deploy data pipeline templates once and reuse them. This is very important. When the entire pipeline is automated, we do not have any issues in respect to deployment of code or with code working in one environment but not working in another environment. We have saved a lot of time and effort from that perspective because it is easy to build ETL pipelines."
- "It could be better integrated with programming languages, like Python and R. Right now, if I want to run a Python code on one of my ETLs, it is a bit difficult to do. It would be great if we have some modules where we could code directly in a Python language. We don't really have a way to run Python code natively."
What is our primary use case?
My work primarily revolves around data migration and data integration for different products. I have used them in different companies, but for most of our use cases, we use it to integrate all the data that needs to flow into our product. Also, we can have outbound from our product when we need to send to different, various integration points. We use this product extensively to build ETLs for those use cases.
We are developing ETLs for the inbound data into the product as well as outbound to various integration points. Also, we have a number of core ETLs written on this platform to enhance our product.
We have two different modes that we offer: one is on-premises and the other is on the cloud. On the cloud, we have an EC2 instance on AWS, then we have installed that EC2 instance and we call it using the ETL server. We also have another server for the application where the product is installed.
We use version 8.3 in the production environment, but in the dev environment, we use version 9 and onwards.
How has it helped my organization?
We have been able to reduce the effort required to build sophisticated ETLs. Also, we now are in the migration phase from an on-prem product to a cloud-native application.
We use Lumada’s ability to develop and deploy data pipeline templates once and reuse them. This is very important. When the entire pipeline is automated, we do not have any issues in respect to deployment of code or with code working in one environment but not working in another environment. We have saved a lot of time and effort from that perspective because it is easy to build ETL pipelines.
What is most valuable?
The metadata injection feature is the most valuable because we have used it extensively to build frameworks, where we have used it to dynamically generate code based on different configurations. If you want to make a change at all, you do not need to touch the actual code. You just need to make some configuration changes and the framework will dynamically generate code for that as per your configuration.
We have a UI where we can create our ETL pipelines as needed, which is a key advantage for us. This is very important because it reduces the time to develop for a given project. When you need to build the whole thing using code, you need to do multiple rounds of testing. Therefore, it helps us to save some effort on the QA side.
Hitachi Vantara's roadmap has a pretty good list of features that they have been releasing with every new version. For instance, in version 9, they have included metadata injection for some of the steps. The most important elements of this roadmap to our organization’s strategy are the data-driven approach that this product is taking and the fact that we have a very low-code platform. Combining these two is what gives us the flexibility to utilize this software to enhance our product.
What needs improvement?
It could be better integrated with programming languages, like Python and R. Right now, if I want to run a Python code on one of my ETLs, it is a bit difficult to do. It would be great if we have some modules where we could code directly in a Python language. We don't really have a way to run Python code natively.
For how long have I used the solution?
I have been working with this tool for five to six years.
What do I think about the stability of the solution?
They are making it a lot more stable. Earlier, stability used to be an issue when it was not with Hitachi. Now, we don't see those kinds of issues or bugs within the platform because it has become far more stable. Also, we see a lot of new big data features, such as connecting to the cloud.
What do I think about the scalability of the solution?
Lumada is flexible to deploy in any environment, whether on-premises or the cloud, which is very important. When we are processing data in batches on certain days, e.g., at the end of the week or month, we might have more data and need more processing power or RAM. However, most times, there might be very minimal usage of that CPU power. In that way, the solution has helped us to dynamically scale up, then scale down when we see that we have more data that we need to process.
The scalability is another key advantage of this product versus some of the others in the market since we can tweak and modify a number of parameters. We are really impressed with the scalability.
We have close to 80 people who are using this product actively. Their roles go all the way from junior developers to support engineers. We also have people who have very little coding knowledge and are more into the management side of things utilizing this tool.
How are customer service and support?
I haven't been part of any technical support discussions with Hitachi.
Which solution did I use previously and why did I switch?
We are very satisfied with our decision to purchase Hitachi's product. Previously, we were using another ETL service that had a number of limitations. It was not a modern ETL service at all. For anything, we had to rely on another third-party software. Then, with Hitachi Lumada, we don't have to do that. In that way, we are really satisfied with the orchestration or cloud-native steps that they offer. We are really happy on those fronts.
We were using something called Actian Services, which had less features and it ended up costing more than the enterprise edition of Pentaho.
We could not do a number of things on Actian. For instance, we were unable to call other APIs or connect to an S3 bucket. It was not a very modern solution. Whereas, with Pentaho, we could do all these things as well as have great marketplaces where we could find various modules and third-party plugins. Those features were simply not there in the other tool.
How was the initial setup?
The initial setup was pretty straightforward.
What about the implementation team?
We did not have any issues configuring it, even in my local machine. For the enterprise edition, we have a separate infrastructure team doing that. However, for at least the community edition, the deployment is pretty straightforward.
What was our ROI?
We have seen at least 30% savings in terms of effort. That has helped us to price our service and products more aggressively in the market, helping us to win more clients.
It has reduced our ETL development time. Per project, it has reduced by around 30% to 35%.
We can price more aggressively. We were actually able to win projects because we had great reusability of ETLs. A code that was used for one client can be reused with very minimal changes. We didn't have any upfront cost for kick-starting projects using the Community edition. It is only the Enterprise edition that has a cost.
What's my experience with pricing, setup cost, and licensing?
For most development tasks, the Enterprise edition should be sufficient. It depends on the type of support that you require for your production environment.
Which other solutions did I evaluate?
We did evaluate SSIS since our database is based on Microsoft SQL server. SSIS comes with any purchase of an SQL Server license. However, even with SSIS, there were some limitations. For example, if you want to build a package and reuse it, SSIS doesn't provide the same kinds of abilities that Pentaho does. The amount of reusability reduces when we try to build the same thing using SSIS. Whereas, in Pentaho, we could literally reuse the same code by using some of its features.
SSIS comes with the SQL Server and is easier to maintain, given that there are far more people who would have knowledge of SSIS. However, if I want to do a PCP encryption or make an API connection, it is difficult. To create a reusable package is not that easy, which would be the con for SSIS.
What other advice do I have?
The query performance depends on the database. It is more likely to be good if you have a good database server with all the indexes and bells and whistles of a database. However, from a data integration tool perspective, I am not seeing any issues with respect to query performance.
We do not build visualization features that much with Hitachi. For the reporting purposes, we have been using one of the tools from the product, then prepare the data accordingly.
We use this for all the projects that we are currently running. Going forward, we will be sticking only to using this ETL tool.
We haven't had any roadblocks using Lumada Data Integration.
On a scale of one to 10, I would recommend Hitachi Vantara to a friend or colleague as a nine.
If you need to build ETLs quickly in a low-code environment, where you don't want to spend a lot of time on the development side of things but it is a little difficult to find resources, then train them in this product. It is always worth that effort because it ends up saving a lot of time and resources on the development side of projects.
Overall, I would rate the product as a nine out of 10.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Head of Data Engineering at a tech consulting company with 201-500 employees
The drag-and-drop interface makes it easier to use than some competing products
Pros and Cons
- "We can schedule job execution in the BA Server, which is the front-end product we're using right now. That scheduling interface is nice."
- "The web interface is rusty, and the biggest problem with Pentaho is debugging and troubleshooting. It isn't easy to build the pipeline incrementally. At least in our case, it's hard to find a way to execute step by step in the debugging mode."
What is our primary use case?
We use Pentaho for small ETL integration jobs and cross-storage analytics. It's nothing too major. We have it deployed on-premise, and we are still on the free version of the product.
In our case, processing takes place on the virtual machine where we installed Pentaho. We can ingest data from different on-premises and cloud locations. We still don't carry out the data processing phase inside a different environment from where the VM is running.
How has it helped my organization?
At the start of my team's journey at the company, it was difficult to do cross-platform storage analytics. That means ingesting data from different analytics sources inside a single storage machine and building out KPIs and some other analytics.
Pentaho was a good start because we can create different connections and import data. We can then do some global queries on that data from various sources. We've been able to replace some of our other data tools like Talend for our managing data warehouse workflow. Later, we adopted some other cloud technologies, so we don't primarily use Pentaho for those use cases anymore.
What is most valuable?
Pentaho is flexible with a drag-and-drop interface that makes it easier to use than some other ETL products. For example, the full stack we are using in AWS does not have drag-and-drop functionality. Pentaho was a good option at the start of this journey.
We can schedule job execution in the BA Server, which is the front-end product we're using right now. That scheduling interface is nice.
What needs improvement?
It's difficult to use custom code. Implementing a pipeline with pre-built blocks is straightforward, but it's harder to insert custom code inside the pre-built blocks. The web interface is rusty, and the biggest problem with Pentaho is debugging and troubleshooting. It isn't easy to build the pipeline incrementally. At least in our case, it's hard to find a way to execute step by step in the debugging mode.
Repository management is also a shortcoming, but I'm not sure if that's just a limitation of the free version. I'm not sure if Pentaho can use an external repository. It's a flat-file repository inside a virtual machine. Back in the day, we would want to deploy this repository on a database.
Pentaho's data management covers ingestion and insights but I'm not sure if it's end-to-end management—at least not in the free version we are using—because some of the intermediate steps are missing, like data cataloging and data governance features. This is the weak spot of our Pentaho version.
For how long have I used the solution?
We implemented Hitachi Pentaho some time ago. We have been using it for around five or six years. I was using the product at the time, but now I am the head of the data engineering team, so I don't use it anymore but I know Pentaho's strengths and weaknesses.
What do I think about the stability of the solution?
Pentaho is relatively stable, but I average about one failed job every month.
What do I think about the scalability of the solution?
I rate Pentaho six out of 10 for scalability. The scalability depends on how you deploy it. In our case, the on-premise virtual machine is relatively small and doesn't have a lot of resources. That is why Pentaho does not handle big datasets well in our case.
I'm also unsure if we can deploy Pentaho in the cloud. So when you're not dealing with the cloud, scalability is always limited. We cannot indefinitely pump resources into a virtual machine.
Currently, we have five or six active workflows running each night. Some of them are ingesting data from ADU. Others take data from AWS Redshift or on-premise Oracle. In terms of people, three other people on the data engineering team and I are actively using Pentaho.
Which solution did I use previously and why did I switch?
We used Talend, which is a Java-based solution and is made for people with proficiency in Java. The entire analytics ecosystem is transitioning to more flexible runtimes, including Python and other languages. Java was not ideal for our data analytics journey.
Right now, we are using NiFi, a tool in the cloud ecosystem that has a similar drag-and-drop interface, but it's embedded in the ADU framework. We're also using another drag-and-drop tool on AWS, but not AWS Glue Studio.
What was our ROI?
We've seen a 50 percent reduction in our ETL development time using the free version of Pentaho. That saves about 1,000 euros per week, so at least 50,000 euros annually.
What other advice do I have?
I rate Pentaho eight out of 10. It's a perfect pick for data teams that are getting started and more business-oriented data teams. It's good for a data analyst who isn't so tech-savvy. It is flexible and easy to use.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
COO / CTO at a tech services company with 11-50 employees
We can create pipelines with minimal manual or custom coding, and we can quickly implement what we need with its drag-and-drop interface
Pros and Cons
- "Its drag-and-drop interface lets me and my team implement all the solutions that we need in our company very quickly. It's a very good tool for that."
- "In terms of the flexibility to deploy in any environment, such as on-premise or in the cloud, we can do the cloud deployment only through virtual machines. We might also be able to work on different environments through Docker or Kubernetes, but we don't have an Azure app or an AWS app for easy deployment to the cloud. We can only do it through virtual machines, which is a problem, but we can manage it. We also work with Databricks because it works with Spark. We can work with clustered servers, and we can easily do the deployment in the cloud. With a right-click, we can deploy Databricks through the app on AWS or Azure cloud."
What is our primary use case?
We are a service delivery enterprise, and we have different use cases. We deliver solutions to other enterprises, such as banks. One of the use cases is for real-time analytics of the data we work with. We take CDC data from Oracle Database, and in real-time, we generate a product offer for all the products of a client. All this is in real-time. The client could be at the ATM or maybe at an agency, and they can access the product offer.
We also use Pentaho within our organization to integrate all the documents and Excel spreadsheets from our consultants and have a dashboard for different hours for different projects.
In terms of version, currently, Pentaho Data Integration is on version 9, but we are using version 8.2. We have all the versions, but we work with the most stable one.
In terms of deployment, we have two different types of deployments. We have on-prem and private cloud deployments.
How has it helped my organization?
I work with a lot of data. We have about 50 terabytes of information, and working with Pentaho Data Integration along with other databases is very fast.
Previously, I had three people to collect all the data and integrate all Excel spreadsheets. To give me a dashboard with the information that I need, it took them a day or two. Now, I can do this work in about 15 minutes.
It enables us to create pipelines with minimal manual coding or custom coding efforts, which is one of its best features. Pentaho is one of the few tools with which you can do anything you can imagine. Our business is changing all the time, and it is best for our business if I can use less time to develop new pipelines.
It provides the ability to develop and deploy data pipeline templates once and reuse them. I use them at least once a day. It makes my daily life easier when it comes to data pipelines.
Previously, I have used other tools such as Integration Services from Microsoft, Data Services for SAP, and Informatica. Pentaho reduces the ETL implementation time by 5% to 50%.
What is most valuable?
Pentaho from Hitachi is a suite of different tools. Pentaho Data Integration is a part of the suite, and I love the drag-and-drop functionality. It is the best.
Its drag-and-drop interface lets me and my team implement all the solutions that we need in our company very quickly. It's a very good tool for that.
What needs improvement?
Their client support is very bad. It should be improved. There is also not much information on Hitachi forums or Hitachi web pages. It is very complicated.
In terms of the flexibility to deploy in any environment, such as on-premise or in the cloud, we can do the cloud deployment only through virtual machines. We might also be able to work on different environments through Docker or Kubernetes, but we don't have an Azure app or an AWS app for easy deployment to the cloud. We can only do it through virtual machines, which is a problem, but we can manage it. We also work with Databricks because it works with Spark. We can work with clustered servers, and we can easily do the deployment in the cloud. With a right-click, we can deploy Databricks through the app on AWS or Azure cloud.
For how long have I used the solution?
I have been using Pentaho Data Integration for 12 years. The first version that I tested and used was 3.2 in 2010.
How are customer service and support?
Their technical support is not good. I would rate them 2 out of 10 because they don't have good technical skills to solve problems.
How would you rate customer service and support?
Negative
How was the initial setup?
It is very quick and simple. It takes about five minutes.
What other advice do I have?
I have a good knowledge of this solution, and I would highly recommend it to a friend or colleague.
It provides a single, end-to-end data management experience from ingestion to insights, but we have to create different pipelines to generate the metadata management. It's a little bit laborious to work with Pentaho, but we can do that.
I've heard a lot of people say it's complicated to use, but Pentaho is one of the few tools where you can do anything you can imagine. It is very good and quite simple, but you need to have the right knowledge and the right people to handle the tool. The skills needed to create a business intelligence solution or a data integration solution with Pentaho are problem-solving logic and maybe database knowledge. You can develop new steps, and you can develop new functionality in Pentaho Lumada, but you must have the knowledge of advanced Java programming. Our experience, in general, is very good.
Overall, I am satisfied with our decision to purchase Hitachi's product services and solutions. My satisfaction level is at an eight out of ten.
I am not much aware of the roadmap of Hitachi Vantara. I don't read much about that.
I would rate this solution an eight out of ten.
Disclosure: My company has a business relationship with this vendor other than being a customer. Partner
Data Engineer at a tech vendor with 1,001-5,000 employees
We can parallelize the extraction from various servers simultaneously, accelerating our extraction
Pros and Cons
- "The area where Lumada has helped us is in the commercial area. There are many extractions to compose reports about our sales team performance and production steps. Since we are using Lumada to gather data from each industry in each country. We can get data from Argentina, Chile, Brazil, and Colombia at the same time. We can then concentrate and consolidate it in only one place, like our data warehouse. This improves our production performance and need for information about the industry, production data, and commercial data."
- "Lumada could have more native connectors with other vendors, such as Google BigQuery, Microsoft OneDrive, Jira systems, and Facebook or Instagram. We would like to gather data from modern platforms using Lumada, which is a better approach. As a comparison, if you open Power BI to retrieve data, then you can get data from many vendors with cloud-native connectors, such as Azure, AWS, Google BigQuery, and Athena Redshift. Lumada should have more native connectors to help us and facilitate our job in gathering information from these new modern infrastructures and tools."
What is our primary use case?
My primary use case is to provide integration with my source systems, such as ERP systems and SAP systems, and web-based systems, having them primarily integrate with my data warehouse. For this process, I use ETL to treat and gather all the information from my first system, then consolidate it in my data warehouse.
How has it helped my organization?
We needed to gather data from many servers at my company. We had probably 10 or 12 equivalent databases spread around the world, i.e., Brazil, Paraguay, or Chile, and had an instance in each country. So, this server is Microsoft SQL Server-based. We are using Lumada to get the data from these international databases. We can parallelize the extraction from various servers at the same time because we have the same structure, schemas, and tables in each of these SQL Server-based servers. This provides a good value for us, as we can extract data at the same time in parallel, which accelerates our extraction.
In one integration process, I can retrieve data from 10 or 12 servers at the same time in one transformation. In the past, using SQL Server or other manual tools, we needed to have 10 or 12 different processes, one per server. Using Lumada in parallel accelerates our extraction. The tools that Lumada provides enable us to transform the data during this process, integrating the data in our data warehouse with good performance.
Because Lumada uses Java virtual machines, we can deploy and operate in whatever operational system that we want. We can deploy on Linux, even when we had a Linux version from Lumada and a Windows version from Lumada.
It is simple to deploy my ETLs because Lumada has the Pentaho Server version. I installed the desktop version so we can deploy our transformations in the repository. We install our own Lumada on a server, then we have a web interface to schedule our ETLs. We are also able to reschedule our ETLs. We can schedule the hour that we want to run our ETL processes and transformations. We can schedule how many times we want to process the data. We can save all our transformations in a repository located in a Pentaho Server. Since we have a repository, we can save many versions of our transformation, such as 1.0, 1.1, and 1.2, in the repository. I can save four or five versions of a transformation. I can ask Lumada to run only the last version that I saved in the database.
Lumada offers a web interface to follow these transformations. We can check the logs to see if the transformations were successfully completed, we had a network query, or some database log issues. Using Lumada, there is a feature where we can get logs at the execution time. We can also be notified by email if transformations occurred successfully or failed. We have a file for each process that we schedule on Pentaho Server.
The area where Lumada has helped us is in the commercial area. There are many extractions to compose reports about our sales team performance and production steps. Since we are using Lumada to gather data from each industry in each country. We can get data from Argentina, Chile, Brazil, and Colombia at the same time. We can then concentrate and consolidate it in only one place, like our data warehouse. This improves our production performance and need for information about the industry, production data, and commercial data.
What is most valuable?
The features that I use the most are Microsoft Excel table input, S3 CSV Input, and CSV input. Today, the features that are more valuable to me are the table input, then the CSV input. These both are very important. We extract data from the table system for our transactional databases, which are commonly used. We also use the CSV input to get data from AWS S3 and our data lake.
In Lumada, we can parallelize the steps. The performance to query the databases for me is good, especially for transactional databases. Because Lumada uses Java, we can adjust the amount of memory that we want to use to do transformations. So, it is accessible. It's possible to set up the amount of memory that we want to use in the Java VM, which is good. Therefore, Lumada is good, especially with transactional database extraction. It has good performance, not higher performance, but good performance as we query data, and it is possible to parallelize the query. For example, if we have three or four servers to get the data, then we can retrieve the data at the same time, in parallel, in these databases. This is good because we don't need to wait while one of the extractions finishes.
Using Lumada, we don't need to do many manual transformations because we have a native company for many of our transformations. Thus, Lumada is a low-code tool to gather data from SQL, Python, or other transformation tools.
What needs improvement?
Lumada could have more native connectors with other vendors, such as Google BigQuery, Microsoft OneDrive, Jira systems, and Facebook or Instagram. We would like to gather data from modern platforms using Lumada, which is a better approach. As a comparison, if you open Power BI to retrieve data, then you can get data from many vendors with cloud-native connectors, such as Azure, AWS, Google BigQuery, and Athena Redshift. Lumada should have more native connectors to help us and facilitate our job in gathering information from these new modern infrastructures and tools.
For how long have I used the solution?
I have been using Lumada Data Integration for at least four years. I started using it in 2018.
How are customer service and support?
Because we are using the free version of Lumada, we have used only the support on the communities and forums on the Internet.
Lumada does have a paid version, where Hitachi support is specialized in Lumada support.
How was the initial setup?
It is simple to deploy Lumada because we can deploy our transformation in three to five simple steps, saving our transformation in a repository.
I open the Pentaho Server web-based version, then I find the transformation that I deployed. I can schedule this transformation at the hour or recurrence in which I want to run the transformation. It is easy. Because at the end of the process, I can save my transformation and Lumada generates the XML file. We can send this XML file to any user of Lumada, who can open up this model and get the transformation that I developed. As a deployment process, it is straightforward, simple, and not complex.
What was our ROI?
Using Lumada compared to using SQL manually, ETL development time is half the time it took using a basic manual transformation.
What's my experience with pricing, setup cost, and licensing?
There are more types of connectors, but you need to pay.
You need to go through the paid version to have Hitachi Lumada specialized support. However, if you are using the free version, then you will have only the community support. You will depend on the releases from Hitachi to solve some problem or questions that you have, such as bug fixes. You will need to wait for the newest versions or releases to solve these types of problems.
Which other solutions did I evaluate?
I also use Talend Data Integration. For me, Lumada is straightforward and makes it simpler to have transformations as drag and drops. Comparing Talend and Lumada, I think Lumada is easier to use, more than Talend. The comprehension needed for these tools is less with Lumada with than Talend. I can learn Lumada in a day and proceed with my transformations, using some tutorials, since Lumada is easier to use. Whereas, Talend is a more complex solution with more complex transformations.
In Talend's open version, i.e., free version, you won't have a Talend server to deploy models. Thus, you deploy Talend models on the server. If you want to schedule some transformation, then you need to use the operational system where you have infrastructure to run transformations and deploy them. For example, in Talend, we deployed a data model in Talend, but we needed to use Windows Scheduler to also schedule the packets in Talend to process the data in the free version of Talend. Whereas, in the free version of Lumada, we already had it based on the web server. Therefore, we can run our transformations and deploy them on the server. We can schedule in a web interface, which guides us with scheduling the data and checking our logs to see how many transformations we have at a time. This is the biggest difference between Talend and Lumada.
What other advice do I have?
I don't use many templates. I use the solution based on a case-by-case basis.
Considering that Lumada is a free tool, I would rate it as nine out of 10 for the free version.
Which deployment model are you using for this solution?
On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
Buyer's Guide
Download our free Pentaho Data Integration and Analytics Report and get advice and tips from experienced pros
sharing their opinions.
Updated: January 2026
Product Categories
Data IntegrationPopular Comparisons
Informatica Intelligent Data Management Cloud (IDMC)
Azure Data Factory
Informatica PowerCenter
Palantir Foundry
Qlik Talend Cloud
Oracle Data Integrator (ODI)
IBM InfoSphere DataStage
Oracle GoldenGate
SAP Data Services
Spring Cloud Data Flow
Alteryx Designer
Buyer's Guide
Download our free Pentaho Data Integration and Analytics Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which ETL tool would you recommend to populate data from OLTP to OLAP?
- What do you think can be improved with Hitachi Lumada Data Integrations?
- What do you use Hitachi Lumada Data Integrations for most frequently?
- Is using Hitachi Lumada Data Integrations cost-effective? Did this solution save money for your company compared to other products?
- When evaluating Data Integration, what aspect do you think is the most important to look for?
- Microsoft SSIS vs. Informatica PowerCenter - which solution has better features?
- What are the best on-prem ETL tools?
- Which integration solution is best for a company that wants to integrate systems between sales, marketing, and project development operations systems?
- Experiences with Oracle GoldenGate vs. Oracle Data Integrator?
- What are the must-have features for a Data integration system?



















