What needs improvement with StreamSets?

StreamSets is a data integration platform that enables organizations to efficiently move and process data across various systems. It offers a user-friendly interface for designing, deploying, and managing data pipelines, allowing users to easily connect to various data sources and destinations. StreamSets also provides real-time monitoring and alerting capabilities, ensuring that data is flowing smoothly and any issues are quickly addressed.

Download StreamSets Report Read more

Related Q&As

Dec 10, 2021

How does Matillion ETL compare to StreamSets?

Apr 10, 2024

What is your experience regarding pricing and costs for StreamSets?

score 0 · Answer 1 · 2025-04-02T03:18:00Z

One issue I observed with StreamSets is that the memory runs out quickly when processing large volumes of data. Because of this memory issue, we have to upgrade our EC2 boxes in the Amazon AWS infrastructure. I had to switch to a new EC2 box, even though the processor was not fully utilized. It would be beneficial if StreamSets addressed any potential memory leak issues to prevent unnecessary upgrades. Additionally, it would be a great enhancement if StreamSets could produce a lineage graph to visualize how the data has passed through the system.

score 0 · Answer 2 · 2024-04-10T16:56:24Z

We often faced problems, especially with SAP ERP. We struggled because many columns weren't integers or primary keys, which StreamSets couldn't handle. We had to restructure our data tables, which was painful. Also, pipeline failures were common, and data drifting wasn't addressed, which made things worse. Licensing was another issue we encountered.

score 0 · Answer 3 · 2023-07-21T08:45:00Z

StreamSets should provide a mechanism to be able to perform data quality assessment when the data is being moved from one source to the target. So the ability to validate the data against various data rules. Then, based on the failure of data quality assessment, be able to send alerts or information to help people understand the data validation issues.

Saket Pandey Product Manager at a hospitality company with 51-200 employees · Answer 4 · 2023-05-17T11:24:00Z

The design or the way they have set up the protocol is pretty good. One thing that I would like to add is the ability to manually enter data. The way the solution currently works is we don't have the option to manually change the data at any point in time. Being able to do that will allow us to do everything that we want to do with our data. Sometimes, we need to manually manipulate the data to make it more accurate in case our prior bifurcation filters are not good. If we have the option to manually enter the data or make the exact iterations on the data set, that would be a good thing. It does not have that feature. None of the solutions provides this feature, but this is the feature that we are looking for. If we could bifurcate the data or do manual manipulation of data at any point in time, it would be a game changer. Its initial setup could also be a bit easier.

Namanya Brian CEO-founder at Tubayo · Answer 5 · 2023-04-14T09:32:00Z

Sometimes, it is not clear at first how to set up nodes. A site with an explanation of how each node works would be very helpful. Also, it doesn't provide a very good user experience.

Nantabo Jackie Sales Manager at Soft Hostings Limited · Answer 6 · 2023-03-24T12:46:00Z

I identified that if the connection is disconnected and the pipeline is restarted, it sometimes does not reconnect and that has room for improvement. The documentation is inadequate and has room for improvement because the technical support does not regularly update their documentation or the knowledge base. This leads to discrepancies between the software and the documentation, making it difficult to understand.

Kevin Kathiem Mutunga Chief software engineer at Appnomu Business Services · Answer 7 · 2023-03-24T12:32:00Z

There should be a concept of creating double variables because it's still missing. The loading machine mechanism needs to be simplified. Currently, it takes some time to get familiar with and understand that. Visualization and monitoring need to be improved and refined. For example, it is difficult to monitor a job to see what happened in the past seven days when a transfer occurred. The licensing model also has room for improvement. The solution is currently expensive.

Reyansh Kumar Technical Specialist at Accenture · Answer 8 · 2023-03-10T04:20:00Z

The user interface requires some corrections in terms of the menu settings, menu items, and report generation. Also, report generation takes some time.

Ramesh Kuppuswamy Senior Software Developer at Samsung · Answer 9 · 2023-01-06T23:33:00Z

The software is very good overall. Areas for improvement are the error logging and the version history. I would like to see better, more detailed error logging information. Apart from that, I don't think much improvement is required, because the software and features are very good.

Sumesh Gansar Product Marketing Manager at Samsung · Answer 10 · 2023-01-06T22:56:00Z

In terms of the product, I don't think there is any room for improvement because it is very good. One small area of improvement that is very much needed is on the knowledge base side. Sometimes, it is not very clear how to set up a certain process or a certain node for a person who's using the platform for the first time. Some visual explanation or some visually appealing knowledge-based content would be very good. That is something that I could have done with, once I started using it, because I found it very difficult.

reviewer2067186 Product Marketer at a media company with 1,001-5,000 employees · Answer 11 · 2023-01-06T22:40:00Z

In terms of features, I don't have any complaints so far. But one area for improvement could be the cloud storage server speed, as we have faced some latency issues here and there.

score 0 · Answer 12 · 2022-12-01T21:40:00Z

The design experience is the bane of our existence because their documentation is not the best. Even when they update their software, they don't publish the best information on how to update and change your pipeline configuration to make it conform to current best practices. We don't pay for the added support. We use the "freeware version." The user community, as well as the documentation they provide for the standard user, are difficult, at best. However, we have a couple of people in-house here who are experts in data analysis and they have figured out how to use this tool. We have to have people who are extremely skilled to go in and write the pipelines for this software because it's so complicated. The software works great for us, but there is an extremely steep learning curve because they don't provide a lot of information outside of paying their ridiculous support costs. Their support starts at $50,000 a year and up. Also, the built-in data drift resilience for ETL operations requires a bunch of custom code development to be able to handle that. It's somewhat difficult because you have to customize it a fair amount. I also would like a more user-friendly interface and better error-trap handling.

Karthik Rajamani Principal Engineer at Tata Consultancy Services · Answer 13 · 2022-06-14T17:08:00Z

There are a few things that can be better. We create pipelines or jobs in StreamSets Control Hub. It is a great feature, but if there is a way to have a folder structure or organize the pipelines and jobs in Control Hub, it would be great. I submitted a ticket for this some time back. There are certain features that are only available at certain stages. For example, HTTP Client has some great features when it is used as a processor, but those features are not available in HTTP Client as a destination. There could be some improvements on the group side. Currently, if I want to know which users are a part of certain groups, it is not straightforward to see. You have to go to each and every user and check the groups he or she is a part of. They could improve it in that direction. Currently, we have to put in a manual effort. In case something goes wrong, we have to go to each and every user account to check whether he or she is a part of a certain group or not.

score 0 · Answer 14 · 2022-06-09T15:40:00Z

One room for improvement is probably the GUI. It is pretty basic and a lot of improvement is required there. In terms of security, from an architecture perspective, when we want to implement something, and because our organization is very strict when it comes to cybersecurity, we have been struggling a bit because the platform has a few gaps. Those gaps are really gaps based on our organization's requirements. These are not gaps on StreamSets' side. The solution could improve a lot in terms of having more features added to the security model, which would help us. There are quite a few features that we wanted. One is SAP HANA. Currently, we can only use the query to read data from SAP HANA. What we would like to see, as soon as possible, is the ability to read from multiple tables from SAP HANA. That would be a really good thing that we could use immediately. For example, if you have 100 tables in SQL Server or Oracle, then you could just point it to the schema or the 100 tables and ingestion information. However, you can't do that in SAP HANA since StreamSets currently is lacking in this. They do not have a multi-table feature for SAP HANA. Therefore, a multi-table origin for SAP HANA would be helpful.

AbhishekKatara Technical Lead at Sopra Steria · Answer 15 · 2022-05-15T09:42:00Z

The logging mechanism could be improved. If I am working on a pipeline, then create a job out of it and it is running, it will generate constant logs. So, the logging mechanism could be simplified. Now, it is a bit difficult to understand and filter the logs. It takes some time. For example, if I am starting with StreamSets, everything is fine. However, if I want to dig into problems that my pipeline ran into, it initially takes some time to get familiar with it and understand it. I feel the visualization part can be simplified or enhanced a bit, so I can easily see what happened with my job seven days earlier and how many records it transmitted.

score 0 · Answer 16 · 2020-11-19T21:01:53Z

We've seen a couple of cases where it appears to have a memory leak or a similar problem. It grows for a bit and then we'd have to restart the container, maybe once a month when it gets high.

score 0 · Answer 17 · 2018-08-08T07:09:00Z

I would like to see it integrate with other kinds of platforms, other than Java. We're going to have a lot of applications using .NET and other languages or frameworks. StreamSets is very helpful for the old Java platform but it's hard to integrate with the other platforms and frameworks. StreamSets works great for batch processing but we are looking for something that is more real-time. We need latency in numbers below milliseconds.