What is our primary use case?
There are a few use cases we have for Apache Airflow, one being government projects where we perform data operations on a monthly basis. For example, we'll collect data from various agencies, harmonize the data, and then produce a dashboard. In general, it's a BI use case, but focusing on social economy.
We concentrate mainly on BI, and because my team members have strong technical backgrounds we often fall back to using open source tools like Airflow and our own coded solutions.
For a single project, we will typically have three of us working on Airflow at a time. This includes two data engineers and a system administrator. Our infrastructure model is hybrid, based both in the cloud and on-premises.
What is most valuable?
The best part of Airflow is its direct support for Python, especially because Python is so important for data science, engineering, and design. This makes the programmatic aspect of our work easy for us, and it means we can automate a lot.
It's such a natural fit because our engineers are also Python-based, and I think we also quite like that we don't have to learn different kinds of UIs. Airflow is based on standard software packages, so we don't have to learn anything new in the way of opinionated UIs from different vendors.
What needs improvement?
We're currently using version 1.10, but I understand that there's a lot of improvements in version 2. In the earlier version that we're using, we sometimes have problems with maintenance complexity. Actually using Airflow is okay, but maintaining it has been difficult.
When something fails, it's not that easy to troubleshoot what went wrong. Sometimes the UI becomes really slow and there's no easy way to diagnose the problem. For the most part, we have had to learn through trial and error how to operate it properly.
The UI is also not that attractive, and I feel that the user experience isn't that nice. Version 2 is supposedly better, but without having tried it, I could suggest more improvements in the visual UI. We want to do the ETL as code, but having a nice visual UI to facilitate this process would be great. Because that means we can also rely on non-technical staff, rather than just the three solid technical staff we have here. If there were better features for the UI, like drag-and-drop, then we could expand its use to more of our team.
For how long have I used the solution?
I've been using Apache Airflow for about two and a half years.
What do I think about the stability of the solution?
I think how Apache Airflow works is great. We like the paradigm of ETL as code, which means you define your pipeline as code. All the while, people talk about infrastructure as code, so the practice of ETL as code really fits into that philosophy.
What do I think about the scalability of the solution?
We can scale it well, and it runs on cloud, too. It's compatible with cloud-native technologies like Kubernetes so it has no issues regarding elasticity.
How are customer service and support?
We contacted an Airflow developer for assistance once and it was a good experience.
Which solution did I use previously and why did I switch?
We like to explore different tools, mixing and matching them to our needs, but we have never really found any like Airflow that are to our liking. We tried looking into Talend and Alteryx but we didn't find them suitable to our style or approach.
How was the initial setup?
As a first-time user, it was complex and somewhat difficult to set up as there are many components to put together. You've got your data portion, your scheduler portion, your web server portion, etc., and you've got all these parts to set up at first.
The next project that you get to, it gets easier. You really need to acquire a feel for what you're doing, and once you get over that, it's not too bad.
What about the implementation team?
We implemented Airflow ourselves, with the help of our two in-house data engineers and system administrator. It took around three months to get it deployed initially, from concept into production. Then after that, the goal is just to operate it and keep it running.
What's my experience with pricing, setup cost, and licensing?
Although Airflow is open source software, there's also commercial support for it by Astronomer. We personally don't use the commercial support, but it's always an option if you don't mind the extra cost.
What other advice do I have?
I can recommend Apache Airflow, especially if there are serious data engineers on your team. If, on the other hand, you're looking to enable business users, then it's not suitable.
I would rate Apache Airflow an eight out of ten.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Other
*Disclosure: I am a real user, and this review is based on my own experience and opinions.