Partner / Head of Data & Analytics at Intelligence Software Consulting
Real User
Top 5
2024-05-31T13:44:00Z
May 31, 2024
There are more libraries that are missing and also maybe more capabilities for machine learning. It could have a friendly user interface for pipeline configuration, deployment, and monitoring.
One of the ways to interact with Flink is through a tool called PipeLINK for writing Flink code, and it doesn't require you to use Python directly. While it does offer a Python-like syntax called PyFlink. PyFlink is a subset of Python that is specifically designed for writing Flink code. It provides a simpler and more accessible way to write Flink code compared to using the Java or Scala APIs. PyFlink is not as fully featured as Python itself, so there are some limitations to what you can do with it. So, this is an area for improvement. However, it is a good choice for users who are not familiar with Java or Scala.
The issue we had with Flink was that when you had to refer the schema into the input data stream, it had to be done directly into code. The XLS format where the schema is stored, had to be stored in Python. If the schema changes, you have to redeploy Flink because the basic tasks and jobs are already running. That's one disadvantage. Another was a restriction with Amazon's CloudFormation templates which don't allow for direct deployment in the private subnet. You have to deploy into the public subnet and then from the Amazon console, specify a different private subnet that requires a lot of settings. In general, the integration with Amazon products was not good and was very time-consuming. I'd like to think that has changed.
Partner / Head of Data & Analytics at Intelligence Software Consulting
Real User
Top 5
2021-03-03T20:13:19Z
Mar 3, 2021
One way to improve Flink would be to enhance integration between different ecosystems. For example, there could be more integration with other big data vendors and platforms similar in scope to how Apache Flink works with Cloudera. Apache Flink is a part of the same ecosystem as Cloudera, and for batch processing it's actually very useful but for real-time processing there could be more development with regards to the big data capabilities amongst the various ecosystems out there. I am also looking for more possibilities in terms of what can be implemented in containers and not in Kubernetes. I think our architecture would work really great with more options available to us in this sense. Finally, it's a challenge to find people with the appropriate skills for using Flink. There are a lot of people who know what should be done better in big data systems, but there are still very few people with Flink capabilities.
Head of Data Science at a energy/utilities company with 10,001+ employees
Real User
2021-02-02T17:14:03Z
Feb 2, 2021
I am using the Python API and I have found the solution to be underdeveloped compared to others. There needs to be better integration with notebooks to allow for more practical development. Additionally, there are no managed services. For example, on Azure, you would have to set everything up yourself. In a future release, they could improve on making the error descriptions more clear.
Software Development Engineer III at a tech services company with 5,001-10,000 employees
Real User
2020-11-08T16:21:05Z
Nov 8, 2020
Flink has become a lot more stable but the machine learning library is still not very flexible. There are some models which are not able to plug and play. In order to use some of the libraries and models, I need to have a Python library because there might be some pre-processing or post-processing requirements, or to even parse and use the models. The lack of Python support is something they can maybe work on in the future.
Principal Software Engineer at a tech services company with 1,001-5,000 employees
Real User
2020-10-21T04:33:00Z
Oct 21, 2020
In terms of improvement, there should be better reporting. You can integrate with reporting solutions but Flink doesn't offer it themselves. They're more about the processing side. Low latency processing is out of their scope. As ar as low latency is concerned, you can integrate to other backend solutions as well. They have that flexibility. APIs are good enough. Its in-memory is so fast, you could have faster-developed data and stuff like that.
Sr. Software Engineer at a tech services company with 10,001+ employees
Real User
2020-10-19T09:33:00Z
Oct 19, 2020
In Flink, maintaining the infrastructure is not easy. You have to design the architecture well. If you want to scale for a larger number of streaming data you need good machines. You need good resilience architecture so that if it fails, you can recover from those with minimum downtime. You should have good storage systems to store and retrieve intermediate flink states(in case of stateful applications). Basically all the problems that come with a distribution system. So you have to have all that infrastructure for it to perform well. Best way is to look at the use cases you wish to support in 5-10 years ahead and design the architecture around flink accordingly.
Lead Software Engineer at a tech services company with 5,001-10,000 employees
Real User
2020-10-13T07:21:29Z
Oct 13, 2020
TimeWindow feature. The timing of the content and the windowing is a bit changed in 1.11. They have introduced watermarks. Watermark is basically associating data in the stream with a timestamp. Documentation can be referred. They have updated rest of the documentaion but not the testing documentation. Therefore, We have to manually try and understand few concepts. Integration of Apache Flink with other metric services or failure handling data tools needs some kind of update or its in-depth knowledge is expected before integrating. Consider a use case where you want to actually analyze or get analytics about how much data you have processed and how many failed? Prometheus is one of the common metric tools out of the box supported by flink, along with other metric services. The documentation is straight forward. There is a learning curve with metric services, which can consume a lot of time, if not well versed with those tools. Failure handling basic documentation is provided by flink, like restart on task failure, fixed delay restart...etc.
Sr Software Engineer at a tech vendor with 10,001+ employees
Real User
2020-10-13T07:21:29Z
Oct 13, 2020
We have a machine learning team that works with Python, but Apache Flink does not have full support for the language. We needed to use Java to implement some of our job posting pipelines.
Software Architect at a tech vendor with 501-1,000 employees
Real User
2020-10-07T07:04:00Z
Oct 7, 2020
The state maintains checkpoints and they use RocksDB or S3. They are good but sometimes the performance is affected when you use RocksDB for checkpointing. We can write python bolts/applications inside Apache Storm Code and it supports Python as a programming language, but with Flink, the Python support is not that great. When we do machine learning, data science, or ML work, we want to integrate the data science or machine learning pipeline with our real-time pipeline and most of the data science or machine learning work is in Python. It was very easy with Storm. Storm supports native Python language, so integration was easy. But Flink is mostly Java. The integration of Python with Java is difficult, so it's not direct integration. We need to find an alternative way. We created an API layer in between so the Java and Python layers were communicating by using an API. We just called data science models or ML models using the API which runs in Python while Flink runs in Java. We would like to see improvement where we can have another way to run it. Currently, it's there, but it's not that great. This is an area that we would like to see improvement.
Apache Flink is an open-source batch and stream data processing engine. It can be used for batch, micro-batch, and real-time processing. Flink is a programming model that combines the benefits of batch processing and streaming analytics by providing a unified programming interface for both data sources, allowing users to write programs that seamlessly switch between the two modes. It can also be used for interactive queries.
Flink can be used as an alternative to MapReduce for executing...
There are more libraries that are missing and also maybe more capabilities for machine learning. It could have a friendly user interface for pipeline configuration, deployment, and monitoring.
Apache Flink should improve its data capability and data migration.
One of the ways to interact with Flink is through a tool called PipeLINK for writing Flink code, and it doesn't require you to use Python directly. While it does offer a Python-like syntax called PyFlink. PyFlink is a subset of Python that is specifically designed for writing Flink code. It provides a simpler and more accessible way to write Flink code compared to using the Java or Scala APIs. PyFlink is not as fully featured as Python itself, so there are some limitations to what you can do with it. So, this is an area for improvement. However, it is a good choice for users who are not familiar with Java or Scala.
Apache Flink's documentation should be available in more languages.
The issue we had with Flink was that when you had to refer the schema into the input data stream, it had to be done directly into code. The XLS format where the schema is stored, had to be stored in Python. If the schema changes, you have to redeploy Flink because the basic tasks and jobs are already running. That's one disadvantage. Another was a restriction with Amazon's CloudFormation templates which don't allow for direct deployment in the private subnet. You have to deploy into the public subnet and then from the Amazon console, specify a different private subnet that requires a lot of settings. In general, the integration with Amazon products was not good and was very time-consuming. I'd like to think that has changed.
The solution could be more user-friendly. The debugging system could be more suitable in the new release.
There is a learning curve. It takes time to learn. The initial setup is complex, it could be simplified.
One way to improve Flink would be to enhance integration between different ecosystems. For example, there could be more integration with other big data vendors and platforms similar in scope to how Apache Flink works with Cloudera. Apache Flink is a part of the same ecosystem as Cloudera, and for batch processing it's actually very useful but for real-time processing there could be more development with regards to the big data capabilities amongst the various ecosystems out there. I am also looking for more possibilities in terms of what can be implemented in containers and not in Kubernetes. I think our architecture would work really great with more options available to us in this sense. Finally, it's a challenge to find people with the appropriate skills for using Flink. There are a lot of people who know what should be done better in big data systems, but there are still very few people with Flink capabilities.
I am using the Python API and I have found the solution to be underdeveloped compared to others. There needs to be better integration with notebooks to allow for more practical development. Additionally, there are no managed services. For example, on Azure, you would have to set everything up yourself. In a future release, they could improve on making the error descriptions more clear.
Flink has become a lot more stable but the machine learning library is still not very flexible. There are some models which are not able to plug and play. In order to use some of the libraries and models, I need to have a Python library because there might be some pre-processing or post-processing requirements, or to even parse and use the models. The lack of Python support is something they can maybe work on in the future.
In terms of improvement, there should be better reporting. You can integrate with reporting solutions but Flink doesn't offer it themselves. They're more about the processing side. Low latency processing is out of their scope. As ar as low latency is concerned, you can integrate to other backend solutions as well. They have that flexibility. APIs are good enough. Its in-memory is so fast, you could have faster-developed data and stuff like that.
In Flink, maintaining the infrastructure is not easy. You have to design the architecture well. If you want to scale for a larger number of streaming data you need good machines. You need good resilience architecture so that if it fails, you can recover from those with minimum downtime. You should have good storage systems to store and retrieve intermediate flink states(in case of stateful applications). Basically all the problems that come with a distribution system. So you have to have all that infrastructure for it to perform well. Best way is to look at the use cases you wish to support in 5-10 years ahead and design the architecture around flink accordingly.
TimeWindow feature. The timing of the content and the windowing is a bit changed in 1.11. They have introduced watermarks. Watermark is basically associating data in the stream with a timestamp. Documentation can be referred. They have updated rest of the documentaion but not the testing documentation. Therefore, We have to manually try and understand few concepts. Integration of Apache Flink with other metric services or failure handling data tools needs some kind of update or its in-depth knowledge is expected before integrating. Consider a use case where you want to actually analyze or get analytics about how much data you have processed and how many failed? Prometheus is one of the common metric tools out of the box supported by flink, along with other metric services. The documentation is straight forward. There is a learning curve with metric services, which can consume a lot of time, if not well versed with those tools. Failure handling basic documentation is provided by flink, like restart on task failure, fixed delay restart...etc.
We have a machine learning team that works with Python, but Apache Flink does not have full support for the language. We needed to use Java to implement some of our job posting pipelines.
The state maintains checkpoints and they use RocksDB or S3. They are good but sometimes the performance is affected when you use RocksDB for checkpointing. We can write python bolts/applications inside Apache Storm Code and it supports Python as a programming language, but with Flink, the Python support is not that great. When we do machine learning, data science, or ML work, we want to integrate the data science or machine learning pipeline with our real-time pipeline and most of the data science or machine learning work is in Python. It was very easy with Storm. Storm supports native Python language, so integration was easy. But Flink is mostly Java. The integration of Python with Java is difficult, so it's not direct integration. We need to find an alternative way. We created an API layer in between so the Java and Python layers were communicating by using an API. We just called data science models or ML models using the API which runs in Python while Flink runs in Java. We would like to see improvement where we can have another way to run it. Currently, it's there, but it's not that great. This is an area that we would like to see improvement.