Setting up pipelines is challenging, especially with version control and testing requirements. While the initial setup is easy, it doesn't accommodate more complex development needs. You might feel hesitant about changing pipelines that are already running and processing business-critical data due to limited versioning and testing capabilities.
I see scope for improvement in the drag-and-drop feature of AWS Glue. Beginners need additional support as it currently lacks some features required for complex transformations, often necessitating custom Python coding.
It is very difficult to learn the tool and remember the syntaxes comparatively. Sometimes, I face issues integrating the solution with some third-party services or services that are not a part of Glue. Such integrations take a lot of time, and not much content is available over the internet for the same.
AVP at a manufacturing company with 10,001+ employees
Real User
Top 5
2024-06-21T06:35:50Z
Jun 21, 2024
The drawbacks associated with the product stem from the fact that, based on the data volume, it can become very costly. There is a huge cost if the source system is not properly designed. If the changes are frequent and not valid, then, initially, you will use huge amounts of data in the ETL. The biggest challenges are associated with AWS Glue's costs, and it takes one-third of my entire pipeline cost.
Since AWS Glue is not like an enterprise ETL tool, we need to put quite a lot of effort into customization. The solution has a visual editor, but most ETL transformations cannot be implemented or constructed using that. We always have to do a script. The solution's visual ETL tool is of no use for actual implementation.
There are output limitations and configuration of its three parts. There was a lot of trial and error that we had to go through. It is not clear how the partition discovery would have been affected by more data coming in. We've made some expensive mistakes, which, if there were any tutorials available or if there was easy documentation available with FAQs, could have been avoided. There is documentation, but it doesn't cover all. There are three specific partition changes, and AWS Glue is tightly tied to Athena. We don't have much flexibility in managing the Athena. AWS Glue could integrate with an AI model or a more advanced version that processes chat-based inputs rather than configuration. This would align it more closely with the functionalities of chat-based interfaces, making it easier to adopt.
Owner at a tech services company with 51-200 employees
Real User
Top 5
2023-09-01T19:46:13Z
Sep 1, 2023
One area that could be improved is the ETL view. The drag-and-drop interface is not as user-friendly as some other ETL tools. Additionally, AWS Glue can sometimes be slow, especially when processing large datasets. It was sometimes a bit slow. Also, I couldn't directly use bucketed data. With Elastic Glue, you had to convert your data frames into the correct format before connecting them using the drag-and-drop interface. So that's something I didn't like because the conversion process wasn't straightforward. In future releases, I would like to see a feature that could trigger Glue pipeline using an API or something.
AWS Glue Studio has undergone a lot of enhancements in the last couple of months. An improvement that can help the solution is if the user interface can become more user-friendly and allow for features like drag and drop, allowing it to build transformations. There can be a good improvement if the product itself supports different kinds of transformations so that the pipeline, which we want to create, can be done easily since right now, we have to write a code to do so in our company. Only people who can code, either in Java or Python, can use the product freely. Those who don't know Java or Python might find using AWS Glue difficult. AWS has pricing for spot instances that reduces the cost substantially, but that is not available for AWS Glue AWS pricing for spot instances comes for products like EC2, and if the same gets introduced for AWS Glue, then the pricing can substantially reduce.
Senior Software Developer at a computer software company with 10,001+ employees
Real User
Top 10
2023-07-31T17:41:50Z
Jul 31, 2023
In terms of performance, if they can further optimize the execution time for serverless jobs, it would be a welcome improvement. Faster code execution would be beneficial. If AWS could enhance the serverless execution capabilities, like increasing CPU, RAM, and processing speed, that would be great.
While working on AWS Glue, I could not find any training material for it. Although it's not a problem with the product, the solution could include better documentation.
We face performance issues when using AWS Glue for data transformation and integration. It takes almost three to four hours to execute single transformations, which is a lot. We want to improve the performance to meet customer requirements. Mainly, I am focused on improving the performance aspect because the customer is keen on this improvement.
Consultant Data junior at a computer software company with 51-200 employees
Consultant
Top 20
2023-03-09T22:01:42Z
Mar 9, 2023
The product has only a few built-in transformations; additional custom-building transformations could be improved in the next release. For additional features, I would like documentation on the equivalent of legacy ETL tools and their equivalent in AWS to make it easier for users to migrate their ETL processing to the cloud. It would save time and help users find the best transformation or solution to satisfy their new business needs.
AWS Glue had some issues, which required optimization, particularly in terms of the number of workers you deploy, and that's where costing comes in. Cost-wise, AWS Glue is expensive, so that's an area for improvement. My company did some modifications, which turned out to be successful, so overall, the solution works fine. Even though there is a backup, you need to know what's happening. You need to understand why there's a failure. AWS Glue doesn't provide the information, so my company uses its logs. The development team also doesn't have specific answers because the team is still playing around with the process, which means the company is still trying to figure out other areas for improvement in AWS Glue. The process for setting up the solution was also complex, which is another area for improvement. AWS should provide help during migration and assist its users. Otherwise, it's a nightmare.
Manager at a construction company with 51-200 employees
Real User
Top 20
2023-01-19T18:04:06Z
Jan 19, 2023
I would like to see in general, documentation, on the limitations on which loads you can actually pull in when you are running Python. The additional Python Jupyter Notebook now has been nice. But yeah, generally speaking, you can not import every LOB. You can import branders now and you can use photos, but you can not import a lot of the other sorts of statistical-based loads. That is an issue currently. I would like to see a more robust interface on the no-code side. This would be nice to be able to split cells.
CEO - Founder / Principal Data Scientist / Principal AI Architect at Kanayma LLC
Real User
2022-11-25T20:48:52Z
Nov 25, 2022
The mapping area and the use of the data catalog from Glue could be better. I would say those two are the main things we'd like to see improvements on. The solution needs support for big data. As I understand it, Glue is based on Lambdas and Lambdas have some limitations as far as running them continuously. Sometimes they get dropped, and they have to be reinitialized.
The interface for AWS Glue could improve, they do not put a lot of details. You can write the code, in PySpark or in Scala, which is a big advantage, it is only easy to use for a developer. It will be difficult for new users to enter the cloud environment. If business users want to run their own graphs they will not have the opportunity to use such features, such as running code inside AWS Glue in Spark, which will be complex for them.
The monitoring is not that good. We'd like to see job progress be more clear. Right now, how we can view that is not that good. The is that mostly it is Python or Scala code based. The UX is lacking. There is a bit of a learning curve, particularly during the setup process. More connectors should be included.
Data Engineer at a tech services company with 201-500 employees
MSP
2022-07-01T09:23:35Z
Jul 1, 2022
There are a couple of issues with AWS Glue. First, AWS Control randomly logs off, which disturbs coding. Second, if there's a cluster-related configuration, we have to make worker notes, which is quite a headache when processing a large amount of data. In the next release, AWS Glue should include more transformations with AWS Studio.
Data Engineer | Developer at Sakshath Technologies
Real User
2022-06-21T13:28:38Z
Jun 21, 2022
The technical support for this solution could be improved. In future, we would like to connect more services like Athena or Kinesis to help control more loads of data.
Sr. Data Engineer at a tech services company with 5,001-10,000 employees
MSP
2022-06-16T15:42:50Z
Jun 16, 2022
It would be better if it were more user-friendly. The interesting thing we found is that it was a little strange at the beginning. The way Glue works is not very straightforward. After trying different things, for example, we used just the console to create jobs. Then we realized that things were not working as expected. After researching and learning more, we realized that even though the console creates the script for the ETL processes, you need to modify or write your own script in Spark to do everything you want it to do. For example, we are pulling data from our source database and our application database, which is in Aurora. From there, we are doing the ETL to transform the data and write the results into Redshift. But what was surprising is that it's almost like whatever you want to do, you can do it with Glue because you have the option to put together your own script. Even though there are many functionalities and many connections, you have the opportunity to write your own queries to do whatever transformations you need to do. It's a little deceiving that some options are supposed to work in a certain way when you set them up in the console, but then they are not exactly working the right way or not as expected. It would be better if they provided more examples and more documentation on options.
Net Full-Stack developer at a tech services company with 201-500 employees
Real User
2021-10-21T11:50:32Z
Oct 21, 2021
When there is a need to configure connections to different database sources in respect of the target, it would be good if it were easier to deal with roles. I am referring to the need to configure connections in a different target process, something which would require a certain time outlay for configuring VPC and checking that everything is okay, in respect of the creation of required roles. It would save time were this process to be made easier and more user friendly. The technical support depends on the type of question, whether there is a need to understand additional inter-related information on multiple levels. Overall, I consider the technical support to be fine, although the response time could be faster in certain cases.
The crucial problem with AWS Glue is that it only works with AWS. It is not an agnostic tool like Pentaho. In PowerCenter, we can install the forms from Google and other vendors, but in the case of AWS Glue, we can only use AWS.
Team Lead at a financial services firm with 5,001-10,000 employees
Real User
2020-10-14T06:36:55Z
Oct 14, 2020
Currently, it supports only two languages in the background: Python and Scala. From our customization point of view, it would be helpful if it can also support Java in the background.
Senior Software Engineer at a consumer goods company with 10,001+ employees
Real User
2020-09-03T07:49:46Z
Sep 3, 2020
The start-up time is really high right now. For instance, when you start up a new job, you have to wait for five or eight minutes before it starts. If the start-up time is reduced to one or two minutes, it will be great. It will be better to have a direct linkage to Redshift in AWS. If we can use data catalogs from Redshift, it will be so easy to create some data catalogs. Currently, we can only use data catalogs from S3.
AWS Glue is a serverless cloud data integration tool that facilitates the discovery, preparation, movement, and integration of data from multiple sources for machine learning (ML), analytics, and application development. The solution includes additional productivity and data ops tooling for running jobs, implementing business workflows, and authoring.
AWS Glue allows users to connect to more than 70 diverse data sources and manage data in a centralized data catalog. The solution facilitates...
Setting up pipelines is challenging, especially with version control and testing requirements. While the initial setup is easy, it doesn't accommodate more complex development needs. You might feel hesitant about changing pipelines that are already running and processing business-critical data due to limited versioning and testing capabilities.
AWS Glue should be more reliable and faster in processing. Enhancing the speed of data processing would be beneficial.
I see scope for improvement in the drag-and-drop feature of AWS Glue. Beginners need additional support as it currently lacks some features required for complex transformations, often necessitating custom Python coding.
The solution’s technical support could be improved.
It is very difficult to learn the tool and remember the syntaxes comparatively. Sometimes, I face issues integrating the solution with some third-party services or services that are not a part of Glue. Such integrations take a lot of time, and not much content is available over the internet for the same.
The drawbacks associated with the product stem from the fact that, based on the data volume, it can become very costly. There is a huge cost if the source system is not properly designed. If the changes are frequent and not valid, then, initially, you will use huge amounts of data in the ETL. The biggest challenges are associated with AWS Glue's costs, and it takes one-third of my entire pipeline cost.
Since AWS Glue is not like an enterprise ETL tool, we need to put quite a lot of effort into customization. The solution has a visual editor, but most ETL transformations cannot be implemented or constructed using that. We always have to do a script. The solution's visual ETL tool is of no use for actual implementation.
There are output limitations and configuration of its three parts. There was a lot of trial and error that we had to go through. It is not clear how the partition discovery would have been affected by more data coming in. We've made some expensive mistakes, which, if there were any tutorials available or if there was easy documentation available with FAQs, could have been avoided. There is documentation, but it doesn't cover all. There are three specific partition changes, and AWS Glue is tightly tied to Athena. We don't have much flexibility in managing the Athena. AWS Glue could integrate with an AI model or a more advanced version that processes chat-based inputs rather than configuration. This would align it more closely with the functionalities of chat-based interfaces, making it easier to adopt.
I have encountered challenges with multi-region support.
The product is expensive for data streaming compared to EMR. This area needs improvement.
One area that could be improved is the ETL view. The drag-and-drop interface is not as user-friendly as some other ETL tools. Additionally, AWS Glue can sometimes be slow, especially when processing large datasets. It was sometimes a bit slow. Also, I couldn't directly use bucketed data. With Elastic Glue, you had to convert your data frames into the correct format before connecting them using the drag-and-drop interface. So that's something I didn't like because the conversion process wasn't straightforward. In future releases, I would like to see a feature that could trigger Glue pipeline using an API or something.
The solution’s stability could be improved.
AWS Glue Studio has undergone a lot of enhancements in the last couple of months. An improvement that can help the solution is if the user interface can become more user-friendly and allow for features like drag and drop, allowing it to build transformations. There can be a good improvement if the product itself supports different kinds of transformations so that the pipeline, which we want to create, can be done easily since right now, we have to write a code to do so in our company. Only people who can code, either in Java or Python, can use the product freely. Those who don't know Java or Python might find using AWS Glue difficult. AWS has pricing for spot instances that reduces the cost substantially, but that is not available for AWS Glue AWS pricing for spot instances comes for products like EC2, and if the same gets introduced for AWS Glue, then the pricing can substantially reduce.
In terms of performance, if they can further optimize the execution time for serverless jobs, it would be a welcome improvement. Faster code execution would be beneficial. If AWS could enhance the serverless execution capabilities, like increasing CPU, RAM, and processing speed, that would be great.
While working on AWS Glue, I could not find any training material for it. Although it's not a problem with the product, the solution could include better documentation.
We face performance issues when using AWS Glue for data transformation and integration. It takes almost three to four hours to execute single transformations, which is a lot. We want to improve the performance to meet customer requirements. Mainly, I am focused on improving the performance aspect because the customer is keen on this improvement.
The solution could be cheaper. The price of the solution is an area that needs improvement.
The product has only a few built-in transformations; additional custom-building transformations could be improved in the next release. For additional features, I would like documentation on the equivalent of legacy ETL tools and their equivalent in AWS to make it easier for users to migrate their ETL processing to the cloud. It would save time and help users find the best transformation or solution to satisfy their new business needs.
AWS Glue had some issues, which required optimization, particularly in terms of the number of workers you deploy, and that's where costing comes in. Cost-wise, AWS Glue is expensive, so that's an area for improvement. My company did some modifications, which turned out to be successful, so overall, the solution works fine. Even though there is a backup, you need to know what's happening. You need to understand why there's a failure. AWS Glue doesn't provide the information, so my company uses its logs. The development team also doesn't have specific answers because the team is still playing around with the process, which means the company is still trying to figure out other areas for improvement in AWS Glue. The process for setting up the solution was also complex, which is another area for improvement. AWS should provide help during migration and assist its users. Otherwise, it's a nightmare.
I would like to see in general, documentation, on the limitations on which loads you can actually pull in when you are running Python. The additional Python Jupyter Notebook now has been nice. But yeah, generally speaking, you can not import every LOB. You can import branders now and you can use photos, but you can not import a lot of the other sorts of statistical-based loads. That is an issue currently. I would like to see a more robust interface on the no-code side. This would be nice to be able to split cells.
The mapping area and the use of the data catalog from Glue could be better. I would say those two are the main things we'd like to see improvements on. The solution needs support for big data. As I understand it, Glue is based on Lambdas and Lambdas have some limitations as far as running them continuously. Sometimes they get dropped, and they have to be reinitialized.
I would like to see stable libraries at the moment they are not there.
The price of the solution could improve.
The interface for AWS Glue could improve, they do not put a lot of details. You can write the code, in PySpark or in Scala, which is a big advantage, it is only easy to use for a developer. It will be difficult for new users to enter the cloud environment. If business users want to run their own graphs they will not have the opportunity to use such features, such as running code inside AWS Glue in Spark, which will be complex for them.
The monitoring is not that good. We'd like to see job progress be more clear. Right now, how we can view that is not that good. The is that mostly it is Python or Scala code based. The UX is lacking. There is a bit of a learning curve, particularly during the setup process. More connectors should be included.
There should be more connectors for different databases.
There are a couple of issues with AWS Glue. First, AWS Control randomly logs off, which disturbs coding. Second, if there's a cluster-related configuration, we have to make worker notes, which is quite a headache when processing a large amount of data. In the next release, AWS Glue should include more transformations with AWS Studio.
The technical support for this solution could be improved. In future, we would like to connect more services like Athena or Kinesis to help control more loads of data.
It would be better if it were more user-friendly. The interesting thing we found is that it was a little strange at the beginning. The way Glue works is not very straightforward. After trying different things, for example, we used just the console to create jobs. Then we realized that things were not working as expected. After researching and learning more, we realized that even though the console creates the script for the ETL processes, you need to modify or write your own script in Spark to do everything you want it to do. For example, we are pulling data from our source database and our application database, which is in Aurora. From there, we are doing the ETL to transform the data and write the results into Redshift. But what was surprising is that it's almost like whatever you want to do, you can do it with Glue because you have the option to put together your own script. Even though there are many functionalities and many connections, you have the opportunity to write your own queries to do whatever transformations you need to do. It's a little deceiving that some options are supposed to work in a certain way when you set them up in the console, but then they are not exactly working the right way or not as expected. It would be better if they provided more examples and more documentation on options.
There is a learning curve to this tool.
When there is a need to configure connections to different database sources in respect of the target, it would be good if it were easier to deal with roles. I am referring to the need to configure connections in a different target process, something which would require a certain time outlay for configuring VPC and checking that everything is okay, in respect of the creation of required roles. It would save time were this process to be made easier and more user friendly. The technical support depends on the type of question, whether there is a need to understand additional inter-related information on multiple levels. Overall, I consider the technical support to be fine, although the response time could be faster in certain cases.
The crucial problem with AWS Glue is that it only works with AWS. It is not an agnostic tool like Pentaho. In PowerCenter, we can install the forms from Google and other vendors, but in the case of AWS Glue, we can only use AWS.
Currently, it supports only two languages in the background: Python and Scala. From our customization point of view, it would be helpful if it can also support Java in the background.
The start-up time is really high right now. For instance, when you start up a new job, you have to wait for five or eight minutes before it starts. If the start-up time is reduced to one or two minutes, it will be great. It will be better to have a direct linkage to Redshift in AWS. If we can use data catalogs from Redshift, it will be so easy to create some data catalogs. Currently, we can only use data catalogs from S3.