Apache Hadoop vs Apache Spark comparison

The compared Apache Hadoop and Apache Spark solutions aren't in the same category. Apache Hadoop is ranked #8 in Data Warehouse , with an average rating of 7.9, and holds a 5.1% mindshare in the category. Apache Spark is ranked #1 in H , with an average rating of 8.6, and holds a 17.8% mindshare. Additionally, 89% of Apache Hadoop users are willing to recommend the solution, compared to 90% of Apache Spark users who would recommend it.

Apache Hadoop

Read 40 Apache Hadoop reviews

1,448 Views
1,191 Comparison Views

89% willing to recommend

Apache Spark

Read 65 Apache Spark reviews

1,580 Views
1,196 Comparison Views

90% willing to recommend

Apache Hadoop

Apache Spark

Comparison Buyer's Guide

Download the report

Executive Summary

We performed a comparison between Apache Hadoop and Apache Spark based on real PeerSpot user reviews.

Find out what your peers are saying about Snowflake Computing, Oracle, Teradata and others in Data Warehouse.

To learn more, read our detailed Data Warehouse Report (Updated: March 2025).

Buyer's Guide

Data Warehouse

March 2025

Download the complete report

Helped 842,592 peers since 2012

Review summaries and opinions

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:

Categories and Ranking

Apache Hadoop

Average Rating

7.8

Reviews Sentiment

6.7

Number of Reviews

Ranking in other categories

Data Warehouse (8th)

Apache Spark

Average Rating

8.4

Reviews Sentiment

7.7

Number of Reviews

Ranking in other categories

Hadoop (1st), Compute Service (4th), Java Frameworks (2nd)

Mindshare comparison

Apache Hadoop and Apache Spark aren’t in the same category and serve different purposes. Apache Hadoop is designed for Data Warehouse and holds a mindshare of 5.1%, down 5.8% compared to last year.
Apache Spark, on the other hand, focuses on Hadoop, holds 17.8% mindshare, down 21.2% since last year.

Data Warehouse

Hadoop

Q&A Highlights

it_user1272297

Special Adviser Strategy at a university with 501-1,000 employees

Apr 19, 2020

Which is the best RDMBS solution for big data?

I haven't used SQream personally. However, if you are only considering GPU based rdbms's please check the following https://hackernoon.com/which-gpu-database-is-right-for-me-6ceef6a17505

See all answers

Featured Reviews

Sushil Arya

Software developer at Fiserv

Provides ease of integration with the IT workflow of a business

When working with Kafka, I saw that the data came in an incremental order. The incremental data processing part is still not very effective in Apache Hadoop. If the data is already there, it can be processed very effectively, especially if the data is coming in every second. If you want to know the location of some data every second, then such data is not processed effectively in Apache Hadoop. I can say that one of the features where improvements are required revolves around the licensing cost of the tool. If the tool can build some licensing structures in a pay-per-use manner, organizations can get the look and feel of Apache Hadoop. Apache Hadoop can offer a licensing structure of the product that can be seen as similar to how AWS operates. Apache Hadoop can look into the capability of processing incremental data. The tool's setup process can be a scope of improvement. Also, it is not very simple because while doing the setup, we need to do all the server settings, including port listing and firewall configurations. If we look at other products on the market, then they can be made simpler. There are certain shortcomings when it comes to the product's technical support part, making it an area where improvements are required. The time frame for the resolution is an area that needs to be improved. The overall communication part of the technical support team also needs improvement.

Read full review

Ilya Afanasyev

Senior Software Development Engineer at Yahoo!

Reliable, able to expand, and handle large amounts of data well

We use batch processing. It works well with our formats and file versions. There's a lot of functionality. In our pipeline each hour, we make a copy of data from MongoDB, of the changes from MongoDB to some specific file. Each time pipeline copied all of the data, it would do it each time without changes to all of the tables. Tables have a lot of data, and in the last MongoDB version, there is a possibility to read only changed data. This reduced the cost and configuration of the cluster, and we saved about $150,000. The solution is scalable. It's a stable product.

Read full review

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:

Pros

"It's open-source, so it's very cost-effective."

"The platform's quick data processing capabilities have been instrumental in supporting our AI-driven projects."

"We selected Apache Hadoop because it is not dependent on third-party vendors."

"Hadoop is extensible — it's elastic."

"It is a reliable product."

"They have integrated other tools as well, like Power BI and Oracle BI, both on Azure, for reporting. Oracle BI is difficult to integrate."

"The most valuable feature is the database."

"The scalability of Apache Hadoop is very good."

More Apache Hadoop pros

"We use Spark to process data from different data sources."

"One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."

"It provides a scalable machine learning library."

"Apache Spark provides a very high-quality implementation of distributed data processing."

"The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."

"With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."

"Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark. Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more."

"I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library."

More Apache Spark pros

Cons

"The solution is not easy to use. The solution should be easy to use and suitable for almost any case connected with the use of big data or working with large amounts of data."

"The product's availability of comprehensive training materials could be improved for faster onboarding and skill development among team members."

"It needs better user interface (UI) functionalities."

"The main thing is the lack of community support. If you want to implement a new API or create a new file system, you won't find easy support."

"We would like to have more dynamics in merging this machine data with other internal data to make more meaning out of it."

"Since it is an open-source product, there won't be much support."

"The solution is very expensive."

"In certain cases, the configurations for dealing with data skewness do not make any sense."

More Apache Hadoop cons

"At times during the deployment process, the tool goes down, making it look less robust. To take care of the issues in the deployment process, users need to do manual interventions occasionally."

"It's not easy to install."

"When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."

"The initial setup was not easy."

"In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."

"If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation."

"This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed."

"Apache Spark should add some resource management improvements to the algorithms."

More Apache Spark cons

Pricing and Cost Advice

"The product is open-source, but some associated licensing fees depend on the subscription level."

"Do take into consider that data storage and compute capacity scale differently and hence purchasing a "boxed" / 'all-in-one" solution (software and hardware) might not be the best idea."

"We don't directly pay for it. Our clients pay for it, and they usually don't complain about the price. So, it is probably acceptable."

"For any big enterprise the costs can be handled, and it is suitable for big enterprises because the scale of data is large. For medium and small enterprises, the tool is on the high-price side."

"The price of Apache Hadoop could be less expensive."

"The price could be better. Hortonworks no longer exists, and Cloudera killed the free version of Hadoop."

"It's reasonable, but there's room for improvement in cost-effectiveness."

"This is a low cost and powerful solution."

More Apache Hadoop pricing and cost advice

"We are using the free version of the solution."

"Apache Spark is an expensive solution."

"The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks."

"The product is expensive, considering the setup."

"Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure. If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources."

"It is an open-source platform. We do not pay for its subscription."

"Apache Spark is an open-source tool."

"Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera."

More Apache Spark pricing and cost advice

See which vendors are best for you

Use our free recommendation engine to learn which Data Warehouse solutions are best for your needs.

See recommendations

842,592 professionals have used our research since 2012.

Answers from the Community

it_user1272297

Special Adviser Strategy at a university with 501-1,000 employees

Apr 19, 2020

Which is the best RDMBS solution for big data?

I haven't used SQream personally. However, if you are only considering GPU based rdbms's please check the following https://hackernoon.com/which-gpu-database-is-right-for-me-6ceef6a17505

2 out of 4 answers

Russell Rothstein

Founder and CEO at PeerSpot

Jan 27, 2020

Morten, the most popular comparisons of SQream can be found here: https://www.itcentralstation.com/products/sqream-db-alternatives-and-competitors The top ones include Cassandra, MemSQL, MongoDB, and Vertica.

Read full answer

reviewer1219965

Data Architect at a tech services company with 201-500 employees

Jan 27, 2020

I haven't used SQream personally. However, if you are only considering GPU based rdbms's please check the following https://hackernoon.com/which-gpu-database-is-right-for-me-6ceef6a17505

Read full answer

See all 4 answers

Top Industries

By visitors reading reviews

Financial Services Firm

34%

Computer Software Company

11%

University

Energy/Utilities Company

Financial Services Firm

28%

Computer Software Company

13%

Manufacturing Company

Comms Service Provider

Company Size

By reviewers

Large Enterprise

Midsize Enterprise

Small Business

Questions from the Community

What do you like most about Apache Hadoop?

It's primarily open source. You can handle huge data volumes and create your own views, workflows, and tables. I can also use it for real-time data streaming.

See all answers

What is your experience regarding pricing and costs for Apache Hadoop?

The product is open-source, but some associated licensing fees depend on the subscription level. While it might be free for students, organizations typically need to pay for their subscriptions. Th...

See all answers

What needs improvement with Apache Hadoop?

Hadoop lacks OLAP capabilities. I recommend adding a Delta Lake feature to make the data compatible with ACID properties. Also, video and audio streaming import issues could be improved to ensure p...

See all answers

What do you like most about Apache Spark?

We use Spark to process data from different data sources.

See all answers

What is your experience regarding pricing and costs for Apache Spark?

Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud...

See all answers

What needs improvement with Apache Spark?

The Spark solution could improve in scheduling tasks and managing dependencies. Spark alone cannot handle sequential tasks, requiring environments like Airflow scheduler or scripts. For instance, o...

See all answers

Comparisons

Oracle Exadata vs Apache Hadoop

Compared 21% of the time

Azure Data Factory vs Apache Hadoop

Compared 13% of the time

Teradata vs Apache Hadoop

Compared 12% of the time

BigQuery vs Apache Hadoop

Compared 11% of the time

Microsoft Azure Synapse Analytics vs Apache Hadoop

Compared 7% of the time

More Apache Hadoop Competitors

Spring Boot vs Apache Spark

Compared 26% of the time

SAP HANA vs Apache Spark

Compared 12% of the time

AWS Batch vs Apache Spark

Compared 12% of the time

Cloudera Distribution for Hadoop vs Apache Spark

Compared 8% of the time

Spark SQL vs Apache Spark

Compared 7% of the time

More Apache Spark Competitors

Product Reports

Buyer's Guide

Apache Hadoop

March 2025

Download Apache Hadoop product report

Buyer's Guide

Apache Spark

March 2025

Download Apache Spark product report

Overview

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache

Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory

Apache

Sample Customers

Amazon, Adobe, eBay, Facebook, Google, Hulu, IBM, LinkedIn, Microsoft, Spotify, AOL, Twitter, University of Maryland, Yahoo!, Cornell University Web Lab

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions

Find out what your peers are saying about Snowflake Computing, Oracle, Teradata and others in Data Warehouse. Updated: March 2025.

DOWNLOAD NOW

842,592 professionals have used our research since 2012.

We monitor all Data Warehouse reviews to prevent fraudulent reviews and keep review quality high. We do not post reviews by company employees or direct competitors. We validate each review for authenticity via cross-reference with LinkedIn, and personal follow-up with the reviewer when necessary.