Hadoop lacks OLAP capabilities. I recommend adding a Delta Lake feature to make the data compatible with ACID properties. Also, video and audio streaming import issues could be improved to ensure proper data validation.
Improvements in security measures would be beneficial, given the large volumes of data handled. Robust security features are essential to prevent data leaks or breaches. Additionally, integrating advanced capabilities similar to those other solutions would enhance the platform's functionality.
When working with Kafka, I saw that the data came in an incremental order. The incremental data processing part is still not very effective in Apache Hadoop. If the data is already there, it can be processed very effectively, especially if the data is coming in every second. If you want to know the location of some data every second, then such data is not processed effectively in Apache Hadoop. I can say that one of the features where improvements are required revolves around the licensing cost of the tool. If the tool can build some licensing structures in a pay-per-use manner, organizations can get the look and feel of Apache Hadoop. Apache Hadoop can offer a licensing structure of the product that can be seen as similar to how AWS operates. Apache Hadoop can look into the capability of processing incremental data. The tool's setup process can be a scope of improvement. Also, it is not very simple because while doing the setup, we need to do all the server settings, including port listing and firewall configurations. If we look at other products on the market, then they can be made simpler. There are certain shortcomings when it comes to the product's technical support part, making it an area where improvements are required. The time frame for the resolution is an area that needs to be improved. The overall communication part of the technical support team also needs improvement.
Tools like Apache Hadoop are knowledge-intensive in nature. Unlike other tools in the market currently, we cannot understand knowledge-intensive products straight away. To use Apache Hadoop, a person needs intensive knowledge, which is something that not everybody can get familiarized with in a straightforward manner. It would be beneficial if navigating through tools like Apache Hadoop is made user-friendly. For non-technical users, if the tool is made easy to navigate, it will be easier to use, and one may not have to depend on experts. The load optimization capabilities of the product are an area of concern where improvements are required. The complex setup phase can be made easier in the future.
The tool provides functionalities to deal with data skewness or a diverse set of data. There are some configurations that it usually provides. In certain cases, the configurations for dealing with data skewness do not make any sense. We usually have to deal with it using a custom solution. Spark would deal with such cases efficiently. If Hadoop solves the issues the way Spark does, it can compete with Spark at the same level. Hive is a little slower than Spark. Spark is in-memory and parallel processing. Hive is not in-memory, but it is parallel processing.
Data Architect at a computer software company with 51-200 employees
Real User
Top 5
2023-12-29T12:05:06Z
Dec 29, 2023
The main thing is the lack of community support. If you want to implement a new API or create a new file system, you won't find easy support. And then there's the server issue. You have to create and maintain servers on your own, which can be hectic. Sometimes, the configurations in the documentation don't work, and without a strong community to turn to, you can get stuck. That's where cloud services play a vital role. In future releases, the community needs to be improved a lot. We need a better community, and the documentation should be more accurate for the setup process. Sometimes, we face errors even when following the documentation for server setup and configuration. We need better support. Even if we raise a ticket, it takes a long time to get addressed, and they don't offer online support. They ask for screenshots, which takes even more time. Instead of direct screensharing or hopping on a call. But it's free, so we can't complain too much.
Hadoop isn't so problematic. It deals with file storage and maintenance. It is a network of file operations. The stability of the solution needs improvement.
The solution is not easy to use. The solution should be easy to use and suitable for almost any case connected with the use of big data or working with large amounts of data.
It could be more user-friendly. Other platforms, such as Cloudera, used for big data, are more user-friendly and presented in a more straightforward way. They are also more flexible than Hadoop. Hadoop's scrollback is not easy to use, either.
Credit & Fraud Risk Analyst at a financial services firm with 10,001+ employees
Real User
2022-09-29T11:28:03Z
Sep 29, 2022
In terms of processing speed, I believe that some of this software as well as the Hadoop-linked software can be better. While analyzing massive amounts of data, you also want it to happen quickly. Faster processing speed is definitely an area for improvement. I am not sure about the cloud's technical aspects, whether there are things that happen in the cloud architecture that essentially make it a little slow, but speed could be one. And, second, the Hadoop-linked programs and Hadoop-linked software that are available could do much more and much better in terms of UI and UX. I mentioned it definitely, and this is probably the only feature we can improve a little bit because the terminal and coding screen on Hadoop is a little outdated, and it looks like the old C++ bio screen. If the UI and UX can be improved slightly, I believe it will go a long way toward increasing adoption and effectiveness.
We have plans to increase usage and this is where we've realized that when we have all these clusters and we're running queries and analyzing, we are facing some latency issues. I think more of the solution needs to be focused around the panel processing and retrieval of data.
IT Expert at a tech services company with 1,001-5,000 employees
Real User
2022-07-21T16:29:00Z
Jul 21, 2022
The price could be better. I think we would use it more, but the company didn't want to pay for it. Hortonworks doesn't exist anymore, and Cloudera killed the free version of Hadoop.
What could be improved in Apache Hadoop is its user-friendliness. It's not that user-friendly, but maybe it's because I'm new to it. Sometimes it feels so tough to use, but it could be because of two aspects: one is my incompetency, for example, I don't know about all the features of Apache Hadoop, or maybe it's because of the limitations of the platform. For example, my team is maintaining the business glossary in Apache Atlas, but if you want to change any settings at the GUI level, an advanced level of coding or programming needs to be done in the back end, so it's not user-friendly.
R&D Head, Big Data Adjunct Professor at SK Communications Co., Ltd.
Real User
2022-01-14T10:24:00Z
Jan 14, 2022
Apache Hadoop's real-time data processing is weak and is not enough to satisfy our customers, so we may have to pick other products. We are continuously researching other solutions and other vendors. Another weak point of this solution, technically speaking, is that it's very difficult to run and difficult to smoothly implement. Preparation and integration are important. The integration of this solution with other data-related products and solutions, and having other functions, e.g. API connectivity, are what I want to see in the next release.
Founder & CTO at a tech services company with 1-10 employees
Real User
2020-12-08T22:10:56Z
Dec 8, 2020
I don't have any concerns because each part of Hadoop has its use cases. To date, I haven't implemented a huge product or project using Hadoop, but on the level of POCs, it's fine. The community of Hadoop is now a cluster, I think there is room for improvement in the ecosystem. From the Apache perspective or the open-source community, they need to add more capabilities to make life easier from a configuration and deployment perspective.
Technical Lead at a government with 201-500 employees
Real User
2020-10-19T09:33:27Z
Oct 19, 2020
For the visualization tools, we use Apache Hadoop and it is very slow. It lacks some query language. We have to use Apache Linux. Even so, the query language still has limitations with just a bit of documentation and many of the visualization tools do not have direct connectivity. They need something like BigQuery which is very fast. We need those to be available in the cloud and scalable. The solution needs to be powerful and offer better availability for gathering queries. The solution is very expensive.
Vice President - Finance & IT at a consumer goods company with 1-10 employees
Real User
2020-07-14T08:15:56Z
Jul 14, 2020
I'm not sure if I have any ideas as to how to improve the product. Every year, the solution comes out with new features. Spark is one new feature, for example. If they could continue to release new helpful features, it will continue to increase the value of the solution. The solution could always improve performance. This is a consistent requirement. Whenever you run it, there is always room for improvement in terms of performance. The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning. We would prefer it if users didn't just get pushed through to certification-based learning, as certifications are expensive. Maybe if they could arrange it so that the certification was at a lesser cost. The certification cost is currently around $2,500 or thereabout.
We're finding vulnerabilities in running it 24/7. We're experiencing some downtime that affects the data. It would be good to have more advanced analytics tools.
Practice Lead (BI/ Data Science) at a tech services company with 11-50 employees
Real User
2019-12-16T08:13:00Z
Dec 16, 2019
It could be because the solution is open source, and therefore not funded like bigger companies, but we find the solution runs slow. The solution isn't as mature as SQL or Oracle and therefore lacks many features. The solution could use a better user interface. It needs a more effective GUI in order to create a better user environment.
What needs improvement depends on the customer and the use case. The classical Hadoop, for example, we consider an old variant. Most now work with flash data. There is a very wide application for this solution, but in enterprise companies, if you work with classical BI systems, it would be good to include an additional presentation layer for BI solutions. There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution.
Hadoop itself is quite complex, especially if you want it running on a single machine, so to get it set up is a big mission. It seems that Hadoop is on it's way out and Spark is the way to go. You can run Spark on a single machine and it's easier to setup. In the next release, I would like to see Hive more responsive for smaller queries and to reduce the latency. I don't think that this is viable, but if it is possible, then latency on smaller guide queries for analysis and analytics. I would like a smaller version that can be run on a local machine. There are installations that do that but are quite difficult, so I would say a smaller version that is easy to install and explore would be an improvement.
IT Expert at a tech services company with 1,001-5,000 employees
Real User
2019-07-28T07:35:00Z
Jul 28, 2019
We are using HDTM circuit boards, and I worry about the future of this product and compatibility with future releases. It's a concern because, for now, we do not have a clear path to upgrade. The Hadoop product is in version three and we'd like to upgrade to the third version. But as far as I know, it's not a simple thing. There are a lot of features in this product that are open-source. If something isn't included with the distribution we are not limited. We can take things from the internet and integrate them. As far as I know, we are using Presto which isn't included in HDP (Hortonworks Data Platform) and it works fine. Not everything has to be included in the release. If something is outside of HDP and it works, that is good enough for me. We have the flexibility to incorporate it ourselves.
Analytics Platform Manager at a consultancy with 10,001+ employees
Real User
2018-08-14T07:42:00Z
Aug 14, 2018
In general, Hadoop has as lot of different component parts to the platform - things like Hive and HBase - and they're all moving somewhat independently and somewhat in parallel. I think as you look to platforms in the cloud or into walled-garden concepts, like Cloudera or Azure, you see that the third-party can make sure all the components work together before they are used for business purposes. That reduces a layer of administration configuration and technical support. I would like to see more direct integration of visualization applications.
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect...
Hadoop lacks OLAP capabilities. I recommend adding a Delta Lake feature to make the data compatible with ACID properties. Also, video and audio streaming import issues could be improved to ensure proper data validation.
Improvements in security measures would be beneficial, given the large volumes of data handled. Robust security features are essential to prevent data leaks or breaches. Additionally, integrating advanced capabilities similar to those other solutions would enhance the platform's functionality.
The product's availability of comprehensive training materials could be improved for faster onboarding and skill development among team members.
When working with Kafka, I saw that the data came in an incremental order. The incremental data processing part is still not very effective in Apache Hadoop. If the data is already there, it can be processed very effectively, especially if the data is coming in every second. If you want to know the location of some data every second, then such data is not processed effectively in Apache Hadoop. I can say that one of the features where improvements are required revolves around the licensing cost of the tool. If the tool can build some licensing structures in a pay-per-use manner, organizations can get the look and feel of Apache Hadoop. Apache Hadoop can offer a licensing structure of the product that can be seen as similar to how AWS operates. Apache Hadoop can look into the capability of processing incremental data. The tool's setup process can be a scope of improvement. Also, it is not very simple because while doing the setup, we need to do all the server settings, including port listing and firewall configurations. If we look at other products on the market, then they can be made simpler. There are certain shortcomings when it comes to the product's technical support part, making it an area where improvements are required. The time frame for the resolution is an area that needs to be improved. The overall communication part of the technical support team also needs improvement.
Since it is an open-source product, there won't be much support. So, you have to have deeper knowledge. You need to improvise based on that.
Tools like Apache Hadoop are knowledge-intensive in nature. Unlike other tools in the market currently, we cannot understand knowledge-intensive products straight away. To use Apache Hadoop, a person needs intensive knowledge, which is something that not everybody can get familiarized with in a straightforward manner. It would be beneficial if navigating through tools like Apache Hadoop is made user-friendly. For non-technical users, if the tool is made easy to navigate, it will be easier to use, and one may not have to depend on experts. The load optimization capabilities of the product are an area of concern where improvements are required. The complex setup phase can be made easier in the future.
The tool provides functionalities to deal with data skewness or a diverse set of data. There are some configurations that it usually provides. In certain cases, the configurations for dealing with data skewness do not make any sense. We usually have to deal with it using a custom solution. Spark would deal with such cases efficiently. If Hadoop solves the issues the way Spark does, it can compete with Spark at the same level. Hive is a little slower than Spark. Spark is in-memory and parallel processing. Hive is not in-memory, but it is parallel processing.
The main thing is the lack of community support. If you want to implement a new API or create a new file system, you won't find easy support. And then there's the server issue. You have to create and maintain servers on your own, which can be hectic. Sometimes, the configurations in the documentation don't work, and without a strong community to turn to, you can get stuck. That's where cloud services play a vital role. In future releases, the community needs to be improved a lot. We need a better community, and the documentation should be more accurate for the setup process. Sometimes, we face errors even when following the documentation for server setup and configuration. We need better support. Even if we raise a ticket, it takes a long time to get addressed, and they don't offer online support. They ask for screenshots, which takes even more time. Instead of direct screensharing or hopping on a call. But it's free, so we can't complain too much.
Hadoop isn't so problematic. It deals with file storage and maintenance. It is a network of file operations. The stability of the solution needs improvement.
The solution is not easy to use. The solution should be easy to use and suitable for almost any case connected with the use of big data or working with large amounts of data.
It could be more user-friendly. Other platforms, such as Cloudera, used for big data, are more user-friendly and presented in a more straightforward way. They are also more flexible than Hadoop. Hadoop's scrollback is not easy to use, either.
In terms of processing speed, I believe that some of this software as well as the Hadoop-linked software can be better. While analyzing massive amounts of data, you also want it to happen quickly. Faster processing speed is definitely an area for improvement. I am not sure about the cloud's technical aspects, whether there are things that happen in the cloud architecture that essentially make it a little slow, but speed could be one. And, second, the Hadoop-linked programs and Hadoop-linked software that are available could do much more and much better in terms of UI and UX. I mentioned it definitely, and this is probably the only feature we can improve a little bit because the terminal and coding screen on Hadoop is a little outdated, and it looks like the old C++ bio screen. If the UI and UX can be improved slightly, I believe it will go a long way toward increasing adoption and effectiveness.
We have plans to increase usage and this is where we've realized that when we have all these clusters and we're running queries and analyzing, we are facing some latency issues. I think more of the solution needs to be focused around the panel processing and retrieval of data.
The price could be better. I think we would use it more, but the company didn't want to pay for it. Hortonworks doesn't exist anymore, and Cloudera killed the free version of Hadoop.
What could be improved in Apache Hadoop is its user-friendliness. It's not that user-friendly, but maybe it's because I'm new to it. Sometimes it feels so tough to use, but it could be because of two aspects: one is my incompetency, for example, I don't know about all the features of Apache Hadoop, or maybe it's because of the limitations of the platform. For example, my team is maintaining the business glossary in Apache Atlas, but if you want to change any settings at the GUI level, an advanced level of coding or programming needs to be done in the back end, so it's not user-friendly.
The integration with Apache Hadoop with lots of different techniques within your business can be a challenge.
Apache Hadoop's real-time data processing is weak and is not enough to satisfy our customers, so we may have to pick other products. We are continuously researching other solutions and other vendors. Another weak point of this solution, technically speaking, is that it's very difficult to run and difficult to smoothly implement. Preparation and integration are important. The integration of this solution with other data-related products and solutions, and having other functions, e.g. API connectivity, are what I want to see in the next release.
Hadoop's security could be better.
I don't have any concerns because each part of Hadoop has its use cases. To date, I haven't implemented a huge product or project using Hadoop, but on the level of POCs, it's fine. The community of Hadoop is now a cluster, I think there is room for improvement in the ecosystem. From the Apache perspective or the open-source community, they need to add more capabilities to make life easier from a configuration and deployment perspective.
For the visualization tools, we use Apache Hadoop and it is very slow. It lacks some query language. We have to use Apache Linux. Even so, the query language still has limitations with just a bit of documentation and many of the visualization tools do not have direct connectivity. They need something like BigQuery which is very fast. We need those to be available in the cloud and scalable. The solution needs to be powerful and offer better availability for gathering queries. The solution is very expensive.
I'm not sure if I have any ideas as to how to improve the product. Every year, the solution comes out with new features. Spark is one new feature, for example. If they could continue to release new helpful features, it will continue to increase the value of the solution. The solution could always improve performance. This is a consistent requirement. Whenever you run it, there is always room for improvement in terms of performance. The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning. We would prefer it if users didn't just get pushed through to certification-based learning, as certifications are expensive. Maybe if they could arrange it so that the certification was at a lesser cost. The certification cost is currently around $2,500 or thereabout.
It would be helpful to have more information on how to best apply this solution to smaller organizations, with less data, and grow the data lake.
We're finding vulnerabilities in running it 24/7. We're experiencing some downtime that affects the data. It would be good to have more advanced analytics tools.
It could be because the solution is open source, and therefore not funded like bigger companies, but we find the solution runs slow. The solution isn't as mature as SQL or Oracle and therefore lacks many features. The solution could use a better user interface. It needs a more effective GUI in order to create a better user environment.
What needs improvement depends on the customer and the use case. The classical Hadoop, for example, we consider an old variant. Most now work with flash data. There is a very wide application for this solution, but in enterprise companies, if you work with classical BI systems, it would be good to include an additional presentation layer for BI solutions. There is a lack of virtualization and presentation layers, so you can't take it and implement it like a radio solution.
Hadoop itself is quite complex, especially if you want it running on a single machine, so to get it set up is a big mission. It seems that Hadoop is on it's way out and Spark is the way to go. You can run Spark on a single machine and it's easier to setup. In the next release, I would like to see Hive more responsive for smaller queries and to reduce the latency. I don't think that this is viable, but if it is possible, then latency on smaller guide queries for analysis and analytics. I would like a smaller version that can be run on a local machine. There are installations that do that but are quite difficult, so I would say a smaller version that is easy to install and explore would be an improvement.
We are using HDTM circuit boards, and I worry about the future of this product and compatibility with future releases. It's a concern because, for now, we do not have a clear path to upgrade. The Hadoop product is in version three and we'd like to upgrade to the third version. But as far as I know, it's not a simple thing. There are a lot of features in this product that are open-source. If something isn't included with the distribution we are not limited. We can take things from the internet and integrate them. As far as I know, we are using Presto which isn't included in HDP (Hortonworks Data Platform) and it works fine. Not everything has to be included in the release. If something is outside of HDP and it works, that is good enough for me. We have the flexibility to incorporate it ourselves.
We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it.
In general, Hadoop has as lot of different component parts to the platform - things like Hive and HBase - and they're all moving somewhat independently and somewhat in parallel. I think as you look to platforms in the cloud or into walled-garden concepts, like Cloudera or Azure, you see that the third-party can make sure all the components work together before they are used for business purposes. That reduces a layer of administration configuration and technical support. I would like to see more direct integration of visualization applications.