What is our primary use case?
I still use this tool on a daily basis. Comparing it to my experience with other ETL tools, the system I created using this tool was quite straightforward. It involves extracting data from MySQL, exporting it to CSV, storing it on S3, and then loading it into Redshift.
The PDI Kettle Job and Kettle Transformation are bundled by a shell script, then scheduled and orchestrated by Jenkins.
We continue to use this tool primarily because many of our legacy systems still rely on it. However, our new solution is mostly based on Airflow, and we are currently in the transition phase. Airflow is a data orchestration tool that predominantly uses Python for ETL processes, scheduling, and issue monitoring—all within a unified system.
How has it helped my organization?
In my current company, this solution has a limited impact as we predominantly employ it for handling older and simpler ETL tasks.
While it serves well in setting up ETL tools on our dashboard, its functionalities can now be found in several other tools available in the market. Consequently, we are planning a complete transition to Airflow, a more versatile and scalable platform. This shift is scheduled to be implemented over the next six months, aiming to enhance our ETL capabilities and align with modern data management practices.
What is most valuable?
This solution offers drag-and-drop tools with a minimal script. Even if you do not come from an IT background or have no software engineering experience, it is possible to use. It is quite intuitive, allowing you to drag and drop many functions.
The abstraction is quite good.
If you're familiar with the product itself, it has transformational abstractions and job abstractions. We can create smaller transformations in the Kettle transformation and larger ones in the Kettle job. Whether you're familiar with Python or have no scripting background at all, the product is useful.
For larger data, we use Spark.
The solution enables us to create pipelines with minimal manual or custom coding efforts. Even without advanced scripting experience, it is possible to create ETL tools. I recently trained a graduate from a management major who had no experience with SQL. Within three months, he became quite fluent, despite having no prior experience using ETL tools.
The importance of handling pipeline creation with minimal coding depends on the team. If we switch to Airflow, more time is needed to teach fluency in the ETL tool. With these product abstractions, I can compress the training time to three months. With Airflow, it would take more than six months to reach the same proficiency.
We use the solution's ability to develop and deploy data pipeline templates and reuse them.
The old system, created by someone prior to me in my organization, is still in use. It was developed a long time ago and is also used for some ad hoc reporting.
The ability to develop and deploy data pipeline templates once and reuse them is crucial to us. There are requests to create pipelines, which I then deploy on our server. The system needs to be robust enough to handle scheduling without failure.
We appreciate the automation. It's hard to imagine how data teams would work if everything were done on an ad hoc basis. Automation is essential. In my organization, 95% of our data distributions are automated, and only 5% are ad hoc. With this solution, we query data manually, process it on spreadsheets, and then distribute it within the organization. Robust automation is key.
We can easily deploy the solution on the cloud, specifically on AWS. I haven't tried it on another server. We deploy it on our AWS EC2, but we develop it on local computers, including both Windows and MacBooks.
I have personally used it on both. Developing on Windows is easier to navigate. On MacBooks, the display becomes problematic when enabling dark mode.
The solution has reduced our ETL development time compared to scripting. However, this largely depends on your experience.
What needs improvement?
Five years ago, when I had less experience with scripting, I would have definitely used this product over Airflow, as the abstraction is quite intuitive and easier for me to work with. Back then, I would have chosen this product over other tools that use pure scripting, as it would have significantly reduced the time required to develop ETL tools. However, this is no longer the case, as I now have more familiarity with scripting.
When I first joined my organization, I was still using Windows. Developing the ETL system on Windows is quite straightforward. However, when I switched to a MacBook, it became quite a hassle. To open the application, we had to first open the terminal, navigate to the solution's directory, and then run the executable file. Additionally, the display becomes quite problematic when dark mode is enabled on a MacBook.
Therefore, developing on a MacBook is quite a hassle, whereas developing on Windows is not much different from using other ETL tools on the market, like SQL Server Integration Services, Informatica, etc.
For how long have I used the solution?
I have been consistently using this tool since I joined my current company, which was approximately one year ago.
What do I think about the stability of the solution?
The performance is good. I have not tested the product at its bleeding edge. We only perform simple jobs. In terms of data, we extract it from MySQL and export it to CSV. There are only millions of data points, not billions. So far, it has met our expectations and is quite good for a smaller number of data points.
What do I think about the scalability of the solution?
I'm not sure that the product could keep up with significant data growth. It can be useful for millions of data points, but I haven't explored its capability with billions of data points. I think there are better solutions available on the market. This applies to other drag-and-drop ETL tools as well, like SQL Server Integration Services, Informatica, etc.
How are customer service and support?
We don't really use technical support. The current version that we are using is no longer supported by their representatives. We didn't update it yet to the newer version.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We're moving to Airflow. The switch was mostly due to debugging problems. If you're familiar with SQL for integration services, the ETL tools from Microsoft have quite intuitive debugging functions. You can easily identify which transformation has failed or where an error has occurred. However, in our current solution, my colleagues have reported that it is difficult to pinpoint the source of errors directly.
Airflow is highly customizable and not as rigid as our current product. We can deploy simple ETL tools as well as machine learning systems on Airflow. Airflow primarily uses Python, which our team is quite familiar with. Currently, only two out of 27 people on our team handle this solution, so not enough people know how to use it.
How was the initial setup?
There are no separations between the deployment and other teams. Each of our teams acts as individual contributors. We handle the entire implementation process, from face-to-face business meetings, setting timelines, developing the tools, and defining the requirements, to production deployment.
The initial setup is straightforward. Currently, the use of version control in our organization is quite loose. We are not using any version control software. The way we deploy it is as simple as putting the Kettle transformation file onto our EC2 server and overwriting the old file, that's it.
What's my experience with pricing, setup cost, and licensing?
I'm not really sure about the pricing of the product. I'm not involved in procurement or commissioning.
What other advice do I have?
We put it on our AWS EC2 server; however, during development, it was on our local server. We deploy it onto our EC2 server. We bundle it in our shell scripts, and the shell scripts are run by Jenkins.
I'd rate the solution seven out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: I am a real user, and this review is based on my own experience and opinions.