Senior Data Platform Manager at a manufacturing company with 10,001+ employees
Real User
Top 5
2024-04-10T16:56:24Z
Apr 10, 2024
We often faced problems, especially with SAP ERP. We struggled because many columns weren't integers or primary keys, which StreamSets couldn't handle. We had to restructure our data tables, which was painful. Also, pipeline failures were common, and data drifting wasn't addressed, which made things worse. Licensing was another issue we encountered.
Director Data Engineering, Governance, Operation and Analytics Platform at a financial services firm with 10,001+ employees
Real User
Top 20
2023-07-21T08:45:00Z
Jul 21, 2023
StreamSets should provide a mechanism to be able to perform data quality assessment when the data is being moved from one source to the target. So the ability to validate the data against various data rules. Then, based on the failure of data quality assessment, be able to send alerts or information to help people understand the data validation issues.
The design or the way they have set up the protocol is pretty good. One thing that I would like to add is the ability to manually enter data. The way the solution currently works is we don't have the option to manually change the data at any point in time. Being able to do that will allow us to do everything that we want to do with our data. Sometimes, we need to manually manipulate the data to make it more accurate in case our prior bifurcation filters are not good. If we have the option to manually enter the data or make the exact iterations on the data set, that would be a good thing. It does not have that feature. None of the solutions provides this feature, but this is the feature that we are looking for. If we could bifurcate the data or do manual manipulation of data at any point in time, it would be a game changer. Its initial setup could also be a bit easier.
When using Transformer for Snowflake, it's a bit complex to understand the transformation logic. You need someone who has some technical skills to handle it. You need to have some skills to transform the data. However, it's important that Transformer for Snowflake is a serverless engine embedded within the platform, so there is no need for maintenance. Having a serverless engine makes it easy for any enterprise to not think about or worry about the cost of maintaining the software. The data collector in StreamSets has to be designed properly. For example, a simple database configuration with MySQL DB requires the MySQL Connector to be installed.
Sometimes, it is not clear at first how to set up nodes. A site with an explanation of how each node works would be very helpful. Also, it doesn't provide a very good user experience.
I identified that if the connection is disconnected and the pipeline is restarted, it sometimes does not reconnect and that has room for improvement. The documentation is inadequate and has room for improvement because the technical support does not regularly update their documentation or the knowledge base. This leads to discrepancies between the software and the documentation, making it difficult to understand.
Chief software engineer at Appnomu Business Services
Real User
Top 10
2023-03-24T12:32:00Z
Mar 24, 2023
There should be a concept of creating double variables because it's still missing. The loading machine mechanism needs to be simplified. Currently, it takes some time to get familiar with and understand that. Visualization and monitoring need to be improved and refined. For example, it is difficult to monitor a job to see what happened in the past seven days when a transfer occurred. The licensing model also has room for improvement. The solution is currently expensive.
The user interface requires some corrections in terms of the menu settings, menu items, and report generation. Also, report generation takes some time.
The software is very good overall. Areas for improvement are the error logging and the version history. I would like to see better, more detailed error logging information. Apart from that, I don't think much improvement is required, because the software and features are very good.
In terms of the product, I don't think there is any room for improvement because it is very good. One small area of improvement that is very much needed is on the knowledge base side. Sometimes, it is not very clear how to set up a certain process or a certain node for a person who's using the platform for the first time. Some visual explanation or some visually appealing knowledge-based content would be very good. That is something that I could have done with, once I started using it, because I found it very difficult.
Product Marketer at a media company with 1,001-5,000 employees
Real User
Top 5
2023-01-06T22:40:00Z
Jan 6, 2023
In terms of features, I don't have any complaints so far. But one area for improvement could be the cloud storage server speed, as we have faced some latency issues here and there.
Senior Network Administrator at a energy/utilities company with 201-500 employees
Real User
Top 20
2022-12-01T21:40:00Z
Dec 1, 2022
The design experience is the bane of our existence because their documentation is not the best. Even when they update their software, they don't publish the best information on how to update and change your pipeline configuration to make it conform to current best practices. We don't pay for the added support. We use the "freeware version." The user community, as well as the documentation they provide for the standard user, are difficult, at best. However, we have a couple of people in-house here who are experts in data analysis and they have figured out how to use this tool. We have to have people who are extremely skilled to go in and write the pipelines for this software because it's so complicated. The software works great for us, but there is an extremely steep learning curve because they don't provide a lot of information outside of paying their ridiculous support costs. Their support starts at $50,000 a year and up. Also, the built-in data drift resilience for ETL operations requires a bunch of custom code development to be able to handle that. It's somewhat difficult because you have to customize it a fair amount. I also would like a more user-friendly interface and better error-trap handling.
Sometimes, when we have large amounts of data that is very efficiently stored in Hadoop or Kafka, it is not very efficient to run it through StreamSets, due to the lack of efficiency or the resources that StreamSets is using. Also, the hierarchy of names within the dropdowns and the drag-and-drop features are not familiar to users that do not have a technical or programming background. In those cases, the naming conventions are a challenge.
There are a few things that can be better. We create pipelines or jobs in StreamSets Control Hub. It is a great feature, but if there is a way to have a folder structure or organize the pipelines and jobs in Control Hub, it would be great. I submitted a ticket for this some time back. There are certain features that are only available at certain stages. For example, HTTP Client has some great features when it is used as a processor, but those features are not available in HTTP Client as a destination. There could be some improvements on the group side. Currently, if I want to know which users are a part of certain groups, it is not straightforward to see. You have to go to each and every user and check the groups he or she is a part of. They could improve it in that direction. Currently, we have to put in a manual effort. In case something goes wrong, we have to go to each and every user account to check whether he or she is a part of a certain group or not.
Senior Data Engineer at a energy/utilities company with 1,001-5,000 employees
Real User
2022-06-09T15:40:00Z
Jun 9, 2022
One room for improvement is probably the GUI. It is pretty basic and a lot of improvement is required there. In terms of security, from an architecture perspective, when we want to implement something, and because our organization is very strict when it comes to cybersecurity, we have been struggling a bit because the platform has a few gaps. Those gaps are really gaps based on our organization's requirements. These are not gaps on StreamSets' side. The solution could improve a lot in terms of having more features added to the security model, which would help us. There are quite a few features that we wanted. One is SAP HANA. Currently, we can only use the query to read data from SAP HANA. What we would like to see, as soon as possible, is the ability to read from multiple tables from SAP HANA. That would be a really good thing that we could use immediately. For example, if you have 100 tables in SQL Server or Oracle, then you could just point it to the schema or the 100 tables and ingestion information. However, you can't do that in SAP HANA since StreamSets currently is lacking in this. They do not have a multi-table feature for SAP HANA. Therefore, a multi-table origin for SAP HANA would be helpful.
The logging mechanism could be improved. If I am working on a pipeline, then create a job out of it and it is running, it will generate constant logs. So, the logging mechanism could be simplified. Now, it is a bit difficult to understand and filter the logs. It takes some time. For example, if I am starting with StreamSets, everything is fine. However, if I want to dig into problems that my pipeline ran into, it initially takes some time to get familiar with it and understand it. I feel the visualization part can be simplified or enhanced a bit, so I can easily see what happened with my job seven days earlier and how many records it transmitted.
Data Engineer at a energy/utilities company with 10,001+ employees
Real User
2020-11-19T21:01:53Z
Nov 19, 2020
We've seen a couple of cases where it appears to have a memory leak or a similar problem. It grows for a bit and then we'd have to restart the container, maybe once a month when it gets high.
Senior Technical Manager at a financial services firm with 501-1,000 employees
Real User
2018-08-08T07:09:00Z
Aug 8, 2018
I would like to see it integrate with other kinds of platforms, other than Java. We're going to have a lot of applications using .NET and other languages or frameworks. StreamSets is very helpful for the old Java platform but it's hard to integrate with the other platforms and frameworks. StreamSets works great for batch processing but we are looking for something that is more real-time. We need latency in numbers below milliseconds.
StreamSets is a data integration platform that enables organizations to efficiently move and process data across various systems. It offers a user-friendly interface for designing, deploying, and managing data pipelines, allowing users to easily connect to various data sources and destinations. StreamSets also provides real-time monitoring and alerting capabilities, ensuring that data is flowing smoothly and any issues are quickly addressed.
We often faced problems, especially with SAP ERP. We struggled because many columns weren't integers or primary keys, which StreamSets couldn't handle. We had to restructure our data tables, which was painful. Also, pipeline failures were common, and data drifting wasn't addressed, which made things worse. Licensing was another issue we encountered.
StreamSets should provide a mechanism to be able to perform data quality assessment when the data is being moved from one source to the target. So the ability to validate the data against various data rules. Then, based on the failure of data quality assessment, be able to send alerts or information to help people understand the data validation issues.
The design or the way they have set up the protocol is pretty good. One thing that I would like to add is the ability to manually enter data. The way the solution currently works is we don't have the option to manually change the data at any point in time. Being able to do that will allow us to do everything that we want to do with our data. Sometimes, we need to manually manipulate the data to make it more accurate in case our prior bifurcation filters are not good. If we have the option to manually enter the data or make the exact iterations on the data set, that would be a good thing. It does not have that feature. None of the solutions provides this feature, but this is the feature that we are looking for. If we could bifurcate the data or do manual manipulation of data at any point in time, it would be a game changer. Its initial setup could also be a bit easier.
When using Transformer for Snowflake, it's a bit complex to understand the transformation logic. You need someone who has some technical skills to handle it. You need to have some skills to transform the data. However, it's important that Transformer for Snowflake is a serverless engine embedded within the platform, so there is no need for maintenance. Having a serverless engine makes it easy for any enterprise to not think about or worry about the cost of maintaining the software. The data collector in StreamSets has to be designed properly. For example, a simple database configuration with MySQL DB requires the MySQL Connector to be installed.
Sometimes, it is not clear at first how to set up nodes. A site with an explanation of how each node works would be very helpful. Also, it doesn't provide a very good user experience.
I identified that if the connection is disconnected and the pipeline is restarted, it sometimes does not reconnect and that has room for improvement. The documentation is inadequate and has room for improvement because the technical support does not regularly update their documentation or the knowledge base. This leads to discrepancies between the software and the documentation, making it difficult to understand.
There should be a concept of creating double variables because it's still missing. The loading machine mechanism needs to be simplified. Currently, it takes some time to get familiar with and understand that. Visualization and monitoring need to be improved and refined. For example, it is difficult to monitor a job to see what happened in the past seven days when a transfer occurred. The licensing model also has room for improvement. The solution is currently expensive.
The user interface requires some corrections in terms of the menu settings, menu items, and report generation. Also, report generation takes some time.
The software is very good overall. Areas for improvement are the error logging and the version history. I would like to see better, more detailed error logging information. Apart from that, I don't think much improvement is required, because the software and features are very good.
In terms of the product, I don't think there is any room for improvement because it is very good. One small area of improvement that is very much needed is on the knowledge base side. Sometimes, it is not very clear how to set up a certain process or a certain node for a person who's using the platform for the first time. Some visual explanation or some visually appealing knowledge-based content would be very good. That is something that I could have done with, once I started using it, because I found it very difficult.
In terms of features, I don't have any complaints so far. But one area for improvement could be the cloud storage server speed, as we have faced some latency issues here and there.
The design experience is the bane of our existence because their documentation is not the best. Even when they update their software, they don't publish the best information on how to update and change your pipeline configuration to make it conform to current best practices. We don't pay for the added support. We use the "freeware version." The user community, as well as the documentation they provide for the standard user, are difficult, at best. However, we have a couple of people in-house here who are experts in data analysis and they have figured out how to use this tool. We have to have people who are extremely skilled to go in and write the pipelines for this software because it's so complicated. The software works great for us, but there is an extremely steep learning curve because they don't provide a lot of information outside of paying their ridiculous support costs. Their support starts at $50,000 a year and up. Also, the built-in data drift resilience for ETL operations requires a bunch of custom code development to be able to handle that. It's somewhat difficult because you have to customize it a fair amount. I also would like a more user-friendly interface and better error-trap handling.
Sometimes, when we have large amounts of data that is very efficiently stored in Hadoop or Kafka, it is not very efficient to run it through StreamSets, due to the lack of efficiency or the resources that StreamSets is using. Also, the hierarchy of names within the dropdowns and the drag-and-drop features are not familiar to users that do not have a technical or programming background. In those cases, the naming conventions are a challenge.
There are a few things that can be better. We create pipelines or jobs in StreamSets Control Hub. It is a great feature, but if there is a way to have a folder structure or organize the pipelines and jobs in Control Hub, it would be great. I submitted a ticket for this some time back. There are certain features that are only available at certain stages. For example, HTTP Client has some great features when it is used as a processor, but those features are not available in HTTP Client as a destination. There could be some improvements on the group side. Currently, if I want to know which users are a part of certain groups, it is not straightforward to see. You have to go to each and every user and check the groups he or she is a part of. They could improve it in that direction. Currently, we have to put in a manual effort. In case something goes wrong, we have to go to each and every user account to check whether he or she is a part of a certain group or not.
One room for improvement is probably the GUI. It is pretty basic and a lot of improvement is required there. In terms of security, from an architecture perspective, when we want to implement something, and because our organization is very strict when it comes to cybersecurity, we have been struggling a bit because the platform has a few gaps. Those gaps are really gaps based on our organization's requirements. These are not gaps on StreamSets' side. The solution could improve a lot in terms of having more features added to the security model, which would help us. There are quite a few features that we wanted. One is SAP HANA. Currently, we can only use the query to read data from SAP HANA. What we would like to see, as soon as possible, is the ability to read from multiple tables from SAP HANA. That would be a really good thing that we could use immediately. For example, if you have 100 tables in SQL Server or Oracle, then you could just point it to the schema or the 100 tables and ingestion information. However, you can't do that in SAP HANA since StreamSets currently is lacking in this. They do not have a multi-table feature for SAP HANA. Therefore, a multi-table origin for SAP HANA would be helpful.
The logging mechanism could be improved. If I am working on a pipeline, then create a job out of it and it is running, it will generate constant logs. So, the logging mechanism could be simplified. Now, it is a bit difficult to understand and filter the logs. It takes some time. For example, if I am starting with StreamSets, everything is fine. However, if I want to dig into problems that my pipeline ran into, it initially takes some time to get familiar with it and understand it. I feel the visualization part can be simplified or enhanced a bit, so I can easily see what happened with my job seven days earlier and how many records it transmitted.
We've seen a couple of cases where it appears to have a memory leak or a similar problem. It grows for a bit and then we'd have to restart the container, maybe once a month when it gets high.
I would like to see it integrate with other kinds of platforms, other than Java. We're going to have a lot of applications using .NET and other languages or frameworks. StreamSets is very helpful for the old Java platform but it's hard to integrate with the other platforms and frameworks. StreamSets works great for batch processing but we are looking for something that is more real-time. We need latency in numbers below milliseconds.