Running frequent jobs, such as every minute or five minutes, is not appropriate for Airflow. It is not suitable for real-time ETL tasks. Stream jobs are not its strength.
There is a minor issue with the manual work in Airflow, as everyday activities are managed manually. There is no dashboard for us to check all the Directed Acyclic Graphs (DAGs); a dashboard would help us analyze the work better.
The UI is a little bit outdated according to modern standards. The UI can be enhanced to support some modern standards. Maybe small things such as dark mode and some proper aesthetics can be implemented.
Apache Airflow improved workflow efficiency, but we had to find solutions for large workflows. For instance, a monthly workflow with 1200 jobs had to be split into three to four pieces as it struggled with large job numbers. Loading a workflow with 500 jobs could take 10 minutes, which wasn't acceptable. The most important feature Apache Airflow lacks is support for external configuration files. All classical schedulers like Control-M or Automic allow you to load workflow definitions from YAML, XML, or JSON files, but the tool requires you to write Python programs. Airflow only supports external configuration for variables, not for workflows. To address this, I created a YAML configuration file that I converted into Python programs, but this functionality is missing from Apache Airflow itself. All of its competitors have this feature. In Control-M, Automic, and IBM's scheduler, you can load workflows from XML, JSON, or YAML files.
I have not come across any challenges associated with the product. The scripts that we use in our company refer to the package dependencies in Python, but those are lost when Apache Airflow starts running for a particular test. The in-built package dependencies in Python have some issues in Apache Airflow, making it an area that needs improvement.
The solution lacks certain features. We cannot run real-time jobs in the solution. It supports only batch jobs. If we are using ETL pipelines, it can either be a batch job or a real-time job. Real-time jobs run continuously. They are not scheduled. Apache Airflow is for scheduled jobs, not real-time jobs. It would be a good improvement if the solution could run real-time jobs. Many connectors are available in the product, but some are still missing. We have to build a custom connector if it is not available. The solution must have more in-built connectors for improved functionality.
The automation capabilities could be improved; a visual workflow designer and a graphical tool to reduce coding would be very helpful. But for now, it's sufficient for our simple workflows.
Enhancements become necessary when scaling it up from a few thousand workflows to a more extensive scale of five thousand or ten thousand workflows. At this point, resource management and threading, become critical aspects. This involves optimizing the utilization of resources and threading within the Kubernetes VM ecosystem.
I have some issues with the solution's communication. The solution uses the same database or data set. Sometimes, we consume the same data and send it to a different place when doing a different DAG. When using the UI, I want to see that we use the same data set more than once.
Airflow is a pipeline for transferring code by clients, but for experimental model experiments, Apache Airflow does not have any solution. There is a need for more features on experimental evolution steps.
One improvement could be the inclusion of a plugin with a drag-and-drop feature. This graphical feature would be beneficial when dealing with connectivity and integration services like connecting to BigQuery or other systems. As a first-time user, although the documentation is available, it would be more user-friendly to have a drag-and-drop interface within the portal. Users could simply drag and drop components to create a pseudo-code, making it more flexible and intuitive. Therefore, I suggest having a drag-and-drop feature for a more user-friendly experience and better code management. Moreover, for admins, there should be improved logging capabilities because Apache Airflow does have logging, but it's limited to some database data. It would be better if everything goes into the server where it's hosted. Probably on the interface level. If something goes well for the developers.
There is an area for improvement in onboarding new people. They should make it simple for newcomers. Else, we have to put a senior engineer to operate it.
Lead of Monitoring Tech at a educational organization with 1,001-5,000 employees
Real User
Top 20
2023-06-28T06:37:50Z
Jun 28, 2023
Adding more automated components in Apache Airflow for basic things like exporting the data would be helpful. Apache Airflow is not that easy to use, but we have gotten used to it.
We have faced scenarios where Apache Airflow becomes non-responsive, leading to job failures. To resolve such situations, we had to manually reboot Apache Airflow since it doesn't provide an option to restart within the application. This necessitated modifying some configurations to initiate a restart of all Apache Airflow components. Although Apache Airflow is generally dependable, it may occasionally encounter glitches that can disrupt production flows and batches.
CEO - Founder / Principal Data Scientist / Principal AI Architect at Kanayma LLC
Real User
2022-12-01T16:03:48Z
Dec 1, 2022
The solution can be improved by creating a tool that allows us to do these kinds of things graphically instead of just writing scripts. Hence, the graphical user interface can be improved.
Apache Airflow could be improved by integrating some versioning principles. Currently, we have to swap some tags in our flow. It would be interesting if we can check the product and version all of the product at the same time comparing what scripts have changed from last year to this year, or last month to this month. For example, we have a flow for one project, to version it we need to check it one by one to identify which tags changed and which scripts changed. All of these need to be done manually.
Senior Software Engineer at a pharma/biotech company with 1,001-5,000 employees
Real User
2021-03-26T23:33:18Z
Mar 26, 2021
I am using a Celery Executor and I find that it crashes and I can't see any logs. I can only assume that it's a memory issue and have to blindly restart until eventually, it starts up again. One of the use cases is triggered by input rather than a batch process. For example, we receive a batch of data, it goes through tasks one, two, and three, and a new batch comes in, each subsequent task should be operating on just that data from the prior task. I am used to working on it as the output gets written to a table and then the next task selects all from that upstream table. It could be coded where you are only writing the data for that portion of the task. It could handle state machines and state changes as opposed to the batch proxy. I would like to see it more friendly for other use cases.
We're currently using version 1.10, but I understand that there's a lot of improvements in version 2. In the earlier version that we're using, we sometimes have problems with maintenance complexity. Actually using Airflow is okay, but maintaining it has been difficult. When something fails, it's not that easy to troubleshoot what went wrong. Sometimes the UI becomes really slow and there's no easy way to diagnose the problem. For the most part, we have had to learn through trial and error how to operate it properly. The UI is also not that attractive, and I feel that the user experience isn't that nice. Version 2 is supposedly better, but without having tried it, I could suggest more improvements in the visual UI. We want to do the ETL as code, but having a nice visual UI to facilitate this process would be great. Because that means we can also rely on non-technical staff, rather than just the three solid technical staff we have here. If there were better features for the UI, like drag-and-drop, then we could expand its use to more of our team.
Associate Director - Technologies at a tech services company with 51-200 employees
Real User
2020-12-23T23:10:23Z
Dec 23, 2020
Technical support is an area that needs improvement. The contact numbers should be readily available so that we can call to get support as required. In the future, I would like to see a single-click installation.
Senior Solutions Architect/ Software Architect at a comms service provider with 51-200 employees
Real User
2020-12-22T20:05:08Z
Dec 22, 2020
The graphics in the past have not been ideal. We have several areas where we feel they could improve in terms of being a little bit more flexible. One is implementation. Even though we customized it, there were some specific things we had to do with the image by itself. The management integration was challenging as well. It requires a lot of work on our end. We were creating our own way to integrate things specifically with specific tools. There's not really an ease of management out-of-the-box option for integration. We needed to become a little bit creative to solve that ourselves. The scalability of the solution itself is not as we expected. Being on the cloud, it should be easy to scale, however, it's not. There is no SDC versioning. There's no virtual control for pipelines. We have to build several pipelines for several flows, yet there's not a virtual control to generate them. There's no Python SDK. We need to generate our own scripts and upload them and put them there. However, there's not a realistic case that we can get connected to them. On top of that, the API sets that are provided are very limited. They are not as rich as others. You cannot do much with them.
One specific feature that is missing from Airflow is that the steps of your workflow are not pipelined, meaning the stageless steps of any workflow. Not every workflow can be implemented within Airflow. For example, Step 1 of my workflow will have output which I definitely want to automatically be provided as an input to my Step 2. At the workflow level, we want to have common state management where, across steps, we'll be able to reach the state information. Right now, we're using an external state repository to maintain the state. If Airflow could come up with some kind of implementation, where not every step of the pipeline is an independent step, that would be helpful. I would like it if a part of the output of your previous steps could be Apache input for your next step. That kind of pipeline is missing. When we consider other products like jBPM, Camunda, or Cadence, they have the concept of pipelining. I would also like to see support for more platforms, in terms of programming BPMs. Cadence supports Golang and Java. Legacy components can be from any platform, so if they could provide more client support for Java client library and Golang, that would be helpful. I want it to program in Java.
There are some drawbacks to this solution. The code does not cover all tasks in the data warehouse automation process. Currently , in production, we have a large installation with a complex workflow that includes hundreds of tasks. Most of them are dispatched by existing engine, but not all. For example, sometimes we need to create cycles in our workflow but we are not able to, because Airflow supports only Direct Acyclic Graphs ( DAGs ) We need to develop our workflow description and notations because out of the box, Apache Airflow does not provide some features that are needed. It is our understanding that it is limited by design. We will wait for the latest 2.0 version, as it is awaited to be much more mature than the 1.8-1.10 version. We believe that it will be better. There should be some improvement made to the Doc Management features from within the UI. They should think about Outlook integration, which should be out of the box, and the object model should be expanded to support cyclic graphs inside the workflow.
Apache Airflow is an open-source workflow management system (WMS) that is primarily used to programmatically author, orchestrate, schedule, and monitor data pipelines as well as workflows. The solution makes it possible for you to manage your data pipelines by authoring workflows as directed acyclic graphs (DAGs) of tasks. By using Apache Airflow, you can orchestrate data pipelines over object stores and data warehouses, run workflows that are not data-related, and can also create and manage...
Running frequent jobs, such as every minute or five minutes, is not appropriate for Airflow. It is not suitable for real-time ETL tasks. Stream jobs are not its strength.
There is a minor issue with the manual work in Airflow, as everyday activities are managed manually. There is no dashboard for us to check all the Directed Acyclic Graphs (DAGs); a dashboard would help us analyze the work better.
The UI is a little bit outdated according to modern standards. The UI can be enhanced to support some modern standards. Maybe small things such as dark mode and some proper aesthetics can be implemented.
Apache Airflow improved workflow efficiency, but we had to find solutions for large workflows. For instance, a monthly workflow with 1200 jobs had to be split into three to four pieces as it struggled with large job numbers. Loading a workflow with 500 jobs could take 10 minutes, which wasn't acceptable. The most important feature Apache Airflow lacks is support for external configuration files. All classical schedulers like Control-M or Automic allow you to load workflow definitions from YAML, XML, or JSON files, but the tool requires you to write Python programs. Airflow only supports external configuration for variables, not for workflows. To address this, I created a YAML configuration file that I converted into Python programs, but this functionality is missing from Apache Airflow itself. All of its competitors have this feature. In Control-M, Automic, and IBM's scheduler, you can load workflows from XML, JSON, or YAML files.
I have not come across any challenges associated with the product. The scripts that we use in our company refer to the package dependencies in Python, but those are lost when Apache Airflow starts running for a particular test. The in-built package dependencies in Python have some issues in Apache Airflow, making it an area that needs improvement.
Airflow should support the dynamic drag creation.
The solution lacks certain features. We cannot run real-time jobs in the solution. It supports only batch jobs. If we are using ETL pipelines, it can either be a batch job or a real-time job. Real-time jobs run continuously. They are not scheduled. Apache Airflow is for scheduled jobs, not real-time jobs. It would be a good improvement if the solution could run real-time jobs. Many connectors are available in the product, but some are still missing. We have to build a custom connector if it is not available. The solution must have more in-built connectors for improved functionality.
Apache Airflow should have better integration with cloud platforms.
The automation capabilities could be improved; a visual workflow designer and a graphical tool to reduce coding would be very helpful. But for now, it's sufficient for our simple workflows.
Enhancements become necessary when scaling it up from a few thousand workflows to a more extensive scale of five thousand or ten thousand workflows. At this point, resource management and threading, become critical aspects. This involves optimizing the utilization of resources and threading within the Kubernetes VM ecosystem.
I have some issues with the solution's communication. The solution uses the same database or data set. Sometimes, we consume the same data and send it to a different place when doing a different DAG. When using the UI, I want to see that we use the same data set more than once.
Airflow is a pipeline for transferring code by clients, but for experimental model experiments, Apache Airflow does not have any solution. There is a need for more features on experimental evolution steps.
One improvement could be the inclusion of a plugin with a drag-and-drop feature. This graphical feature would be beneficial when dealing with connectivity and integration services like connecting to BigQuery or other systems. As a first-time user, although the documentation is available, it would be more user-friendly to have a drag-and-drop interface within the portal. Users could simply drag and drop components to create a pseudo-code, making it more flexible and intuitive. Therefore, I suggest having a drag-and-drop feature for a more user-friendly experience and better code management. Moreover, for admins, there should be improved logging capabilities because Apache Airflow does have logging, but it's limited to some database data. It would be better if everything goes into the server where it's hosted. Probably on the interface level. If something goes well for the developers.
There is an area for improvement in onboarding new people. They should make it simple for newcomers. Else, we have to put a senior engineer to operate it.
Adding more automated components in Apache Airflow for basic things like exporting the data would be helpful. Apache Airflow is not that easy to use, but we have gotten used to it.
We have faced scenarios where Apache Airflow becomes non-responsive, leading to job failures. To resolve such situations, we had to manually reboot Apache Airflow since it doesn't provide an option to restart within the application. This necessitated modifying some configurations to initiate a restart of all Apache Airflow components. Although Apache Airflow is generally dependable, it may occasionally encounter glitches that can disrupt production flows and batches.
The following should be improved: * Dashboards * Security * Airflow web UI * Telemetry for logging, monitoring, and alerting purposes * Documentation
The solution can be improved by creating a tool that allows us to do these kinds of things graphically instead of just writing scripts. Hence, the graphical user interface can be improved.
Apache Airflow could be improved with the addition of more frameworks.
The solution could be improved by simplifying the integration process and providing access to its support team to guide integration.
Apache Airflow could be improved by integrating some versioning principles. Currently, we have to swap some tags in our flow. It would be interesting if we can check the product and version all of the product at the same time comparing what scripts have changed from last year to this year, or last month to this month. For example, we have a flow for one project, to version it we need to check it one by one to identify which tags changed and which scripts changed. All of these need to be done manually.
I am using a Celery Executor and I find that it crashes and I can't see any logs. I can only assume that it's a memory issue and have to blindly restart until eventually, it starts up again. One of the use cases is triggered by input rather than a batch process. For example, we receive a batch of data, it goes through tasks one, two, and three, and a new batch comes in, each subsequent task should be operating on just that data from the prior task. I am used to working on it as the output gets written to a table and then the next task selects all from that upstream table. It could be coded where you are only writing the data for that portion of the task. It could handle state machines and state changes as opposed to the batch proxy. I would like to see it more friendly for other use cases.
We're currently using version 1.10, but I understand that there's a lot of improvements in version 2. In the earlier version that we're using, we sometimes have problems with maintenance complexity. Actually using Airflow is okay, but maintaining it has been difficult. When something fails, it's not that easy to troubleshoot what went wrong. Sometimes the UI becomes really slow and there's no easy way to diagnose the problem. For the most part, we have had to learn through trial and error how to operate it properly. The UI is also not that attractive, and I feel that the user experience isn't that nice. Version 2 is supposedly better, but without having tried it, I could suggest more improvements in the visual UI. We want to do the ETL as code, but having a nice visual UI to facilitate this process would be great. Because that means we can also rely on non-technical staff, rather than just the three solid technical staff we have here. If there were better features for the UI, like drag-and-drop, then we could expand its use to more of our team.
The dashboard is connected into the BPM flow that could be improved.
Technical support is an area that needs improvement. The contact numbers should be readily available so that we can call to get support as required. In the future, I would like to see a single-click installation.
The graphics in the past have not been ideal. We have several areas where we feel they could improve in terms of being a little bit more flexible. One is implementation. Even though we customized it, there were some specific things we had to do with the image by itself. The management integration was challenging as well. It requires a lot of work on our end. We were creating our own way to integrate things specifically with specific tools. There's not really an ease of management out-of-the-box option for integration. We needed to become a little bit creative to solve that ourselves. The scalability of the solution itself is not as we expected. Being on the cloud, it should be easy to scale, however, it's not. There is no SDC versioning. There's no virtual control for pipelines. We have to build several pipelines for several flows, yet there's not a virtual control to generate them. There's no Python SDK. We need to generate our own scripts and upload them and put them there. However, there's not a realistic case that we can get connected to them. On top of that, the API sets that are provided are very limited. They are not as rich as others. You cannot do much with them.
One specific feature that is missing from Airflow is that the steps of your workflow are not pipelined, meaning the stageless steps of any workflow. Not every workflow can be implemented within Airflow. For example, Step 1 of my workflow will have output which I definitely want to automatically be provided as an input to my Step 2. At the workflow level, we want to have common state management where, across steps, we'll be able to reach the state information. Right now, we're using an external state repository to maintain the state. If Airflow could come up with some kind of implementation, where not every step of the pipeline is an independent step, that would be helpful. I would like it if a part of the output of your previous steps could be Apache input for your next step. That kind of pipeline is missing. When we consider other products like jBPM, Camunda, or Cadence, they have the concept of pipelining. I would also like to see support for more platforms, in terms of programming BPMs. Cadence supports Golang and Java. Legacy components can be from any platform, so if they could provide more client support for Java client library and Golang, that would be helpful. I want it to program in Java.
There are some drawbacks to this solution. The code does not cover all tasks in the data warehouse automation process. Currently , in production, we have a large installation with a complex workflow that includes hundreds of tasks. Most of them are dispatched by existing engine, but not all. For example, sometimes we need to create cycles in our workflow but we are not able to, because Airflow supports only Direct Acyclic Graphs ( DAGs ) We need to develop our workflow description and notations because out of the box, Apache Airflow does not provide some features that are needed. It is our understanding that it is limited by design. We will wait for the latest 2.0 version, as it is awaited to be much more mature than the 1.8-1.10 version. We believe that it will be better. There should be some improvement made to the Doc Management features from within the UI. They should think about Outlook integration, which should be out of the box, and the object model should be expanded to support cyclic graphs inside the workflow.