Apache Hadoop Reviews and Pricing

Kenechukwu Murphy Ezeoka

IT Support Specialist at Convergys Corporation

Sep 17, 2024

Download

Enables efficient data warehousing and supports a large ecosystem of tools

Pros and Cons

"Its flexibility in handling and storing large volumes of data is particularly beneficial, as is its resilience, which ensures data redundancy and fault tolerance."

"Improvements in security measures would be beneficial, given the large volumes of data handled."

What is our primary use case?

We used the product primarily for data analysis and storage. It helps handle large data sets, performing tasks like filtering, sorting, and joining. The platform is useful for data warehousing and provides distributed coordination and synchronization functionalities.

How has it helped my organization?

The solution has effectively supported our operations primarily due to its cost efficiency. It enables us to manage large data sets without incurring excessive subscription costs, resulting in more efficient data handling and operations.

What is most valuable?

The platform's most valuable feature is its low cost and open-source nature. It runs efficiently on commodity hardware and supports a large ecosystem of tools. Its flexibility in handling and storing large volumes of data is particularly beneficial, as is its resilience, which ensures data redundancy and fault tolerance.

What needs improvement?

Improvements in security measures would be beneficial, given the large volumes of data handled. Robust security features are essential to prevent data leaks or breaches. Additionally, integrating advanced capabilities similar to those other solutions would enhance the platform's functionality.

Buyer's Guide

Apache Hadoop

May 2025

Free Report: Apache Hadoop Reviews and More

Learn what your peers think about Apache Hadoop. Get advice and tips from experienced pros sharing their opinions. Updated: May 2025.

DOWNLOAD NOW

850,671 professionals have used our research since 2012.

For how long have I used the solution?

I have worked with Apache Hadoop for about six to seven months. The duration varied based on the projects I was involved in, as we often switched to different projects with different applications.

What do I think about the stability of the solution?

Although I have encountered some performance issues, the platform has proven to be stable.

I would rate its stability as eight or nine.

What do I think about the scalability of the solution?

This platform's scalability significantly impacts data management capabilities. It allows for simultaneously handling large data volumes, which other applications might struggle with.

How are customer service and support?

I have contacted Apache tech support when encountering issues that could not be resolved internally. Their support is reliable and responsive.

How would you rate customer service and support?

Positive

How was the initial setup?

The initial setup can be complex due to the need for precise coding and configuration. Setting up the required components and ensuring all dependencies are correctly configured is crucial for a successful deployment. Depending on network capability and system specifications, the setup typically takes 30 minutes to one hour if prerequisites are met. Maintenance involves regular updates to ensure the platform runs with the latest features and security patches.

What's my experience with pricing, setup cost, and licensing?

The product is open-source, but some associated licensing fees depend on the subscription level. While it might be free for students, organizations typically need to pay for their subscriptions. The fees were reasonable for my usage, though I am not aware of recent changes to the pricing.

What other advice do I have?

The product is highly effective for processing and managing large data sets. Integrating it with other solutions like AWS can provide additional functionalities, but the cost benefits of using this platform remain significant. I have also used the solution in AI-driven projects with machine learning models, and its integration with Apache Spark has been advantageous. I recommend it to organizations needing to handle large data sets due to its cost-effectiveness and robust capabilities.

I rate it a nine out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Last updated: Sep 17, 2024

Abhik Ray

Co-Founder at Quantic

Aug 5, 2022

Download

Has good processing power and speed and is capable of handling large volumes of data and doing online analysis

Pros and Cons

"The most important feature is its ability to handle large volumes. Some of our customers have really large volumes, and it is capable of handling their data in terms of the core volume and daily incremental volume. So, its processing power and speed are most valuable."

"It requires a great deal of learning curve to understand. The overall Hadoop ecosystem has a large number of sub-products. There is ZooKeeper, and there are a whole lot of other things that are connected. In many cases, their functionalities are overlapping, and for a newcomer or our clients, it is very difficult to decide which of them to buy and which of them they don't really need. They require a consulting organization for it, which is good for organizations such as ours because that's what we do, but it is not easy for the end customers to gain so much knowledge and optimally use it."

What is our primary use case?

Its main use case is to create a data warehouse or data lake, which is a collection of data from multiple product processors used by a banking organization. They have core banking, which has savings accounts or deposits as one system, and they have a CRM or customer information system. They also have a credit card system. All of them are separate systems in most cases, but there is a linkage between the data. So, the main motivation is to consolidate all that data in one place and link it wherever required so that it acts as a single version of the truth, which is used for management reporting, regulatory reporting, and various forms of analyses.

We have done two or three projects with Hadoop, and we have taken the latest version available at that time. So far, it was deployed on-premises.

What is most valuable?

The most important feature is its ability to handle large volumes. Some of our customers have really large volumes, and it is capable of handling their data in terms of the core volume and daily incremental volume. So, its processing power and speed are most valuable.

Another feature that I like is online analysis. In some cases, data requires online analysis. We like using Hadoop for that.

What needs improvement?

It requires a great deal of learning curve to understand. The overall Hadoop ecosystem has a large number of sub-products. There is ZooKeeper, and there are a whole lot of other things that are connected. In many cases, their functionalities are overlapping, and for a newcomer or our clients, it is very difficult to decide which of them to buy and which of them they don't really need. They require a consulting organization for it, which is good for organizations such as ours because that's what we do, but it is not easy for the end customers to gain so much knowledge and optimally use it. However, when it comes to power, I have nothing to say. It is really good.

For how long have I used the solution?

We have been working with this solution for two and a half to three years.

What do I think about the stability of the solution?

The core file system and the offline data ingestion are extremely stable. In my experience, there is a bit less stability during online data ingestion. When you have incremental online data, sometimes it stops or aborts before finishing. It is rare, but it does, but the offline data injection and the basic processing are very stable.

What do I think about the scalability of the solution?

Its scalability is very good. Most of our clients have used it on-prem. So, to a large extent, it is up to them to provide hardware for large data, which they have. Its scalability is linear. As long as the hardware is given to it, there are no complaints.

About 70% of its users are from a client's IT in terms of setting it up and providing support to make sure that the pipeline is there. Business users are about 30%. They are the people who use the analytics derived from the warehouse or data lake. Collectively, there are about 120 users. The size of the data is mostly in terms of the number of records it handles, which could be 30 or 40 million.

How are customer service and support?

We have not dealt with them too many times. I would rate them a four out of five. There are no complaints.

How would you rate customer service and support?

Positive

Which solution did I use previously and why did I switch?

Some of our clients are using Teradata, and some of them are using Hadoop.

How was the initial setup?

After the hardware is available, getting the environment and software up and running has taken us a minimum of a week or 10 days. Sometimes, it has taken us longer, but usually, this is what it takes at the minimum to get everything up. It includes the downloads and also setting it up and making things work together to start using it.

For the original deployment, because there are so many components and not everyone knows everything pretty well, we have seen that we had to deploy four or five people in various areas at the initial deployment stage. However, once it is running, one or two people are required for maintenance.

What was our ROI?

Different clients derive different levels of return based on the sophistication of the analytics that they derive out of it and how they use it. I don't know how much ROI they have got, but I can say that some clients have not got a decent ROI, but some of our clients are happy with it. It is very much client-dependent.

What's my experience with pricing, setup cost, and licensing?

We don't directly pay for it. Our clients pay for it, and they usually don't complain about the price. So, it is probably acceptable.

What other advice do I have?

I would rate it a nine out of ten because of the complexity, but technically, it is okay.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Buyer's Guide

Apache Hadoop

May 2025

Free Report: Apache Hadoop Reviews and More

Learn what your peers think about Apache Hadoop. Get advice and tips from experienced pros sharing their opinions. Updated: May 2025.

DOWNLOAD NOW

850,671 professionals have used our research since 2012.

Teodor Muraru

Developer at Emag

May 26, 2024

Download

Helps to store and retrieve information

Pros and Cons

"Apache Hadoop is crucial in projects that save and retrieve data daily. Its valuable features are scalability and stability. It is easy to integrate with the existing infrastructure."

What is our primary use case?

The solution helps to store and retrieve information.

What is most valuable?

Apache Hadoop is crucial in projects that save and retrieve data daily. Its valuable features are scalability and stability. It is easy to integrate with the existing infrastructure.

For how long have I used the solution?

I have been using the tool for a few years.

What do I think about the stability of the solution?

I rate the tool's stability a nine out of ten.

How are customer service and support?

I take support from the DevOps team.

What other advice do I have?

I recommend the tool to others since it is good.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.

reviewer2324613

Data Architect at a computer software company with 51-200 employees

Jan 6, 2024

Download

Allows for customization and optimization of applications and performance using in-house resources but lacks community support

Pros and Cons

"It's open-source, so it's very cost-effective."

"The main thing is the lack of community support. If you want to implement a new API or create a new file system, you won't find easy support."

What is our primary use case?

We work on Apache Hadoop for various customers.

What is most valuable?

It's open-source, so it's very cost-effective. Apache Hadoop has its strengths. For example, in my previous organization, which was a small startup, we used it because it was cost-effective.

We only had to pay for the servers, and we could optimize applications and performance using our employees, which was especially cost-effective in India. So, human resources were the main investment, not software.

That was five years ago, though. In the last five years, I've mainly seen Redshift, Azure, and Oracle in the market.

What needs improvement?

The main thing is the lack of community support. If you want to implement a new API or create a new file system, you won't find easy support.

And then there's the server issue. You have to create and maintain servers on your own, which can be hectic. Sometimes, the configurations in the documentation don't work, and without a strong community to turn to, you can get stuck. That's where cloud services play a vital role.

In future releases, the community needs to be improved a lot. We need a better community, and the documentation should be more accurate for the setup process.

Sometimes, we face errors even when following the documentation for server setup and configuration. We need better support.

Even if we raise a ticket, it takes a long time to get addressed, and they don't offer online support. They ask for screenshots, which takes even more time. Instead of direct screensharing or hopping on a call. But it's free, so we can't complain too much.

For how long have I used the solution?

I've been working with Apache Hadoop for ten years. I started my career with Hadoop. I've worked with it at Infinia, Microsoft, and AWS, for a total of about eight years.

What do I think about the stability of the solution?

I would rate the stability a seven out of ten. There is room for improvement in performance.

What do I think about the scalability of the solution?

It can be scalable in certain cases. Typically, for startups or product-based companies with limited budgets during product development, Apache Hadoop is often the only viable option. They cannot afford the costs of other cloud-based systems, so Apache Hadoop plays a main role in those scenarios.

Which solution did I use previously and why did I switch?

For some customers, we use Oracle Autonomous Database. Now, I cannot compare Apache Hadoop with Oracle Autonomous Data Warehouse when it comes to value for money. They're not directly comparable.

How was the initial setup?

The initial setup is a hectic task. Configuring servers and nodes takes a long time. That's one of the big advantages of an Autonomous Data Warehouse. You can start implementing within half the time.

With Apache Hadoop, you have to wait for the setup, architecture, and data evaluation. But with Autonomous, those things are automated. It scales as you use more data, so you can focus on the business rather than infrastructure.

What's my experience with pricing, setup cost, and licensing?

We just use the free version.

What other advice do I have?

We can't use Apache Hadoop for everything, like storage and data errors. But we can use some tools that are native to Hadoop, like Kafka.

For the current situation, I'd rate it a seven out of ten.

However, five years ago, I would have rated it a nine out of ten. Back then, I was working with it fully. But now we're used to working with cloud systems. Creating servers is more difficult nowadays.

Disclosure: I am a real user, and this review is based on my own experience and opinions.

reviewer1976262

Credit & Fraud Risk Analyst at a financial services firm with 10,001+ employees

Oct 16, 2022

Download

Has the ability to take a large amount of data and deliver the necessary splices and summary charts

Pros and Cons

"Apache Hadoop can manage large amounts and volumes of data with relative ease, which is a feature that is beneficial."

"I mentioned it definitely, and this is probably the only feature we can improve a little bit because the terminal and coding screen on Hadoop is a little outdated, and it looks like the old C++ bio screen. If the UI and UX can be improved slightly, I believe it will go a long way toward increasing adoption and effectiveness."

What is our primary use case?

We use Apache Hadoop for analytics purposes.

What is most valuable?

The ability to take a lot of data and attempt to basically deliver the appropriate splices and summary chart is the most crucial function that I have discovered.

This stands in contrast to some of the other tools that are available, such as SQL and SAS, which are likely incapable of handling such a large volume of data. Even R, for instance, is unable to handle such data volumes.

Apache Hadoop can manage large amounts and volumes of data with relative ease, which is a feature that is beneficial.

What needs improvement?

In terms of processing speed, I believe that some of this software as well as the Hadoop-linked software can be better. While analyzing massive amounts of data, you also want it to happen quickly. Faster processing speed is definitely an area for improvement.

I am not sure about the cloud's technical aspects, whether there are things that happen in the cloud architecture that essentially make it a little slow, but speed could be one. And, second, the Hadoop-linked programs and Hadoop-linked software that are available could do much more and much better in terms of UI and UX.

I mentioned it definitely, and this is probably the only feature we can improve a little bit because the terminal and coding screen on Hadoop is a little outdated, and it looks like the old C++ bio screen.

If the UI and UX can be improved slightly, I believe it will go a long way toward increasing adoption and effectiveness.

For how long have I used the solution?

I have been using Apache Hadoop for six months.

What do I think about the stability of the solution?

It is far more stable than some of the other software that I have tried. It's also the current version of Hadoop software and is becoming increasingly more stable.

When a new version is released, the subsequent ones are always more stable and easier to use.

What do I think about the scalability of the solution?

According to what I have seen in my current enterprise, once I joined the organization, it was fairly simple to have it for an employee, and this is true for everyone who's been onboarded in my own designation. I would imagine that it is fairly scalable across an enterprise.

I am fairly certain that we have between 10 and 15,000 employees who use it.

How are customer service and support?

I have not had any direct experience with technical support.

We have an in-house technical support team that handles it.

Which solution did I use previously and why did I switch?

I have since changed careers, I no longer use any automation tools, nor does my job need me to compare the capabilities of other tools.

I am working with Risk Analytic tools. I work with data these days, therefore I use technologies like Hive, Shiny R, and other data-intensive programs.

Shiny is a plugin that you can have on R. As a result of changing my profiles, I am now working in a position that is more data-centric and less focused on process automation.

We currently have proprietary tools, a proprietary cloud software, therefore I don't really need to employ any external cloud vendors. Aside from that, I only use the third-party technologies I've already indicated, primarily Hadoop and R.

This is one of the prime, one of the cornerstone software that we use. I have never been in a position to compare the like-for-like comparison with another software.

How was the initial setup?

As it is proprietary software for the enterprise that I am currently working on, I had no trouble setting it up.

What's my experience with pricing, setup cost, and licensing?

I am not sure about the price, but in terms of usability and utility of the software as a whole, I would rate it a three and a half to four out of five.

Which other solutions did I evaluate?

When I was a digital transformation consultant for my prior employer, I downloaded and read the reviews.

It involved learning about workflow automation tools as well as process automation. I looked at a number of these platforms as part of that, but I have never actually used them.

What other advice do I have?

I would recommend this solution for data professionals who have to work hands-on with big data.

For instance, if you work with smaller or more finite data sets, that is, data sets that do not keep updating themselves, I would most likely recommend R or even Excel, where you can do a lot of analysis. However, for data professionals who work with large amounts of data, I would strongly recommend Hadoop. It's a little more technical, but it does the job.

I would rate Apache Hadoop an eight out of ten. I would like to see some improvements, but I appreciate the utility it provides.

Which deployment model are you using for this solution?

Public Cloud

Disclosure: I am a real user, and this review is based on my own experience and opinions.

Randy Chng

Senior Associate at a financial services firm with 10,001+ employees

Sep 17, 2017

Download

Relatively fast when reading data into other platforms but can't handle queries with insufficient memory

Pros and Cons

"As compared to Hive on MapReduce, Impala on MPP returns results of SQL queries in a fairly short amount of time, and is relatively fast when reading data into other platforms like R."

"The key shortcoming is its inability to handle queries when there is insufficient memory. This limitation can be bypassed by processing the data in chunks."

What is most valuable?

Impala. As compared to Hive on MapReduce, Impala on MPP returns results of SQL queries in a fairly short amount of time, and is relatively fast when reading data into other platforms like R (for further data analysis) or QlikView (for data visualisation).

How has it helped my organization?

The quick access to data enabled more frequent data backed decisions.

What needs improvement?

The key shortcoming is its inability to handle queries when there is insufficient memory. This limitation can be bypassed by processing the data in chunks.

For how long have I used the solution?

Two-plus years.

What do I think about the stability of the solution?

Typically instability is experienced due to insufficient memory, either due to a large job being triggered or multiple concurrent small requests.

What do I think about the scalability of the solution?

No. This is by default a cluster-based setup and hence scaling is just a matter of adding on new data nodes.

How are customer service and technical support?

Not applicable to Cloudera. We have a separate onsite vendor to manage the cluster.

Which solution did I use previously and why did I switch?

No. Two years ago this was a new team and hence there were no legacy systems to speak of.

How was the initial setup?

Complex. Cloudera stack itself was insufficient. Integration with other tools like R and QlikView was required and in-house programs had to be built to create an automated data pipeline.

What's my experience with pricing, setup cost, and licensing?

Not much advice as pricing and licensing is handled at an enterprise level.

However do take into consider that data storage and compute capacity scale differently and hence purchasing a "boxed" / 'all-in-one" solution (software and hardware) might not be the best idea.

Which other solutions did I evaluate?

Yes. Oracle Exadata and Teradata.

What other advice do I have?

Try open-source Hadoop first but be aware of greater implementation complexity. If open-source Hadoop is "too" complex, then consider a vendor packaged Hadoop solution like HortonWorks, Cloudera, etc.

Disclosure: I am a real user, and this review is based on my own experience and opinions.

it_user265830

Senior Hadoop Engineer with 1,001-5,000 employees

Dec 8, 2015

Download

The heart of BigData

What is most valuable?

Storage
Processing (cost efficient)

How has it helped my organization?

With the increase in data size for the business, this horizontal scalable appliance has answered every business question in terms of storage and processing. Hadoop ecosystem has not only provided a reliable distributed aggregation system but has also allowed room for analytics which has resulted in great data insights.

What needs improvement?

The Apache team is doing great job and releasing Hadoop versions much ahead of what we can think about. Every room for improvement is fixed as soon as a version is released by ASF. Currently, Apache Oozie 4.0.1 has some compatibility issues with Hadoop 2.5.2.

For how long have I used the solution?

2.5 years

What was my experience with deployment of the solution?

Not at all.

What do I think about the stability of the solution?

We did when we started initially with Hadoop 1.x, which did’t have HA, but now we don’t have any stability issue.

What do I think about the scalability of the solution?

Hadoop is known for its scalability. Yahoo stores approx. 455 PB in their Hadoop cluster.

How are customer service and technical support?

Customer Service:

It depends on the Hadoop distributor. I would rate Hortonworks 9/10.

Technical Support:

I would rate Hortonworks 9/10.

Which solution did I use previously and why did I switch?

We previously used Netezza. We switched because our business required a highly scalable appliance like Hadoop.

How was the initial setup?

It's a bit complex in terms of build around for commodities, but soon it will ease up as the product matures.

What about the implementation team?

We used a vendor team who were 9/10.

What was our ROI?

Valuable storage and processing with a lower cost than previously.

What's my experience with pricing, setup cost, and licensing?

Best in pricing and licensing depends on the flavors, but remember it is only good if you have very large data set which cannot be handled by traditional RDBMS.

Which other solutions did I evaluate?

Cloud options.

What other advice do I have?

First, understand your business requirement; second, evaluate the traditional RDBMS scalability and capability, and finally, if you have reached to the tip of an iceberg (RDBMS) then yes, you definitely need an island (Hadoop) for your business. Feasibility checks are important and efficient for any business before you can take any crucial step. I would also say “Don’t always flow with stream of a river because some time it will lead you to a waterfall, so always research and analyze before you take a ride.”

Disclosure: I am a real user, and this review is based on my own experience and opinions.

reviewer1384338

Vice President - Finance & IT at a consumer goods company with 1-10 employees

Jul 15, 2020

Download

Great micro-partitions, helpful technical support and quite stable

Pros and Cons

"The solution is easy to expand. We haven't seen any issues with it in that sense. We've added 10 servers, and we've added two nodes. We've been expanding since we started using it since we started out so small. Companies that need to scale shouldn't have a problem doing so."

"The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning."

What is our primary use case?

As an example of a use case, when I was a contractor for Cisco, we were processing mobile network data and the volume was too big. RDBMS was not supporting anything. We started using the Hadoop framework to improve the process and get the results faster.

What is most valuable?

The data is stored in micro-partitions which makes the processes very fast compared to other RDBMS systems. Apache Spark is in the memory process, and it's much better than MapReduce.

Micro-partitions and the HDFS are both excellent features.

What needs improvement?

I'm not sure if I have any ideas as to how to improve the product.

Every year, the solution comes out with new features. Spark is one new feature, for example. If they could continue to release new helpful features, it will continue to increase the value of the solution.

The solution could always improve performance. This is a consistent requirement. Whenever you run it, there is always room for improvement in terms of performance.

The solution needs a better tutorial. There are only documents available currently. There's a lot of YouTube videos available. However, in terms of learning, we didn't have great success trying to learn that way. There needs to be better self-paced learning.

We would prefer it if users didn't just get pushed through to certification-based learning, as certifications are expensive. Maybe if they could arrange it so that the certification was at a lesser cost. The certification cost is currently around $2,500 or thereabout.

For how long have I used the solution?

I've been using the solution for four years.

What do I think about the stability of the solution?

We haven't had too many problems with stability. For the POC we used a small amount of data and we started with 10 nodes. We're gradually increasing in now to 40 nodes. We haven't seen any issues after the small teething period in the beginning. The configuration issues and the performance issues have subsided. Once we learned how to stack everything, it has been much better.

What do I think about the scalability of the solution?

The solution is easy to expand. We haven't seen any issues with it in that sense. We've added 10 servers, and we've added two nodes. We've been expanding since we started using it since we started out so small. Companies that need to scale shouldn't have a problem doing so.

We are supporting a multitenancy model and we get the data on supporting the users. I would say, per organization, we have eight to 10 users and probably have a total of around 40 users across the board.

How are customer service and technical support?

We started on the solution as a POC. Once we got into production, we had some minor issues. We get great support. They share advice and helped us tweak some things in terms of the configurations. We've been satisfied with the level of service we've been provided.

Which solution did I use previously and why did I switch?

We have only ever used Apache Hadoop, or a version of it. When we looked for the commercial tier, there was Cloudera and Hortonworks. We started with the Hortonworks due to the fact that at that time we felt it was cost-effective. However, Cloudera bought Hadoop and Hortonworks and now it's all basically the same solution.

How was the initial setup?

The initial setup was a little complex the first time around. We were new to the system, and we didn't have any expertise at that time. Once we get some support and insights into how to work everything properly it went more smoothly.

First, we started with a POC - proof of concept. It takes a couple of days in terms of understanding and configuring everything, etc. When we went to production, it was a couple of hours for deployment and we put into practice everything we learned from the POC.

There's definitely a learning curve. It's stable for us now.

We have a team of developers doing multiple tasks on the solution and few of them are taking care of Hadoop, so we do have a few people handling maintenance.

What about the implementation team?

As we were new to the solution, we found we needed some outside assistance to guide us. However, that was for the POC. In the end, I did it myself.

What other advice do I have?

We're just a customer. We don't have a business relationship with Hadoop.

My day-to-day job is data modeling and architecting.

Originally we used it as an open-source solution. We downloaded it, then we went for a commercial version of it.

In terms of advice, I'd tell other potential users that whether the solution is right for them depends on a few items. If the data volume is too big, it's IoT data, or the stream of data is too much, this solution can handle it and I would definitely recommend Apache Hadoop.

Recently, in the last 18 months, I've been working with the Snowflake, it's a Data Lake project, and I am really impressed with that one. I got a certification so that we started using Snowflake set for our Data Lake environment.

I'd rate the solution eight out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: I am a real user, and this review is based on my own experience and opinions.