Try our new research platform with insights from 80,000+ expert users

Apache Hadoop vs Apache Spark comparison

 

Comparison Buyer's Guide

Executive Summary
 

Categories and Ranking

Apache Hadoop
Average Rating
7.8
Reviews Sentiment
6.8
Number of Reviews
39
Ranking in other categories
Data Warehouse (6th)
Apache Spark
Average Rating
8.4
Reviews Sentiment
7.7
Number of Reviews
64
Ranking in other categories
Hadoop (1st), Compute Service (4th), Java Frameworks (2nd)
 

Mindshare comparison

Apache Hadoop and Apache Spark aren’t in the same category and serve different purposes. Apache Hadoop is designed for Data Warehouse and holds a mindshare of 5.2%, down 6.2% compared to last year.
Apache Spark, on the other hand, focuses on Hadoop, holds 18.0% mindshare, down 21.8% since last year.
Data Warehouse
Hadoop
 

Q&A Highlights

it_user1272297 - PeerSpot reviewer
Apr 19, 2020
 

Featured Reviews

Sushil Arya - PeerSpot reviewer
Provides ease of integration with the IT workflow of a business
When working with Kafka, I saw that the data came in an incremental order. The incremental data processing part is still not very effective in Apache Hadoop. If the data is already there, it can be processed very effectively, especially if the data is coming in every second. If you want to know the location of some data every second, then such data is not processed effectively in Apache Hadoop. I can say that one of the features where improvements are required revolves around the licensing cost of the tool. If the tool can build some licensing structures in a pay-per-use manner, organizations can get the look and feel of Apache Hadoop. Apache Hadoop can offer a licensing structure of the product that can be seen as similar to how AWS operates. Apache Hadoop can look into the capability of processing incremental data. The tool's setup process can be a scope of improvement. Also, it is not very simple because while doing the setup, we need to do all the server settings, including port listing and firewall configurations. If we look at other products on the market, then they can be made simpler. There are certain shortcomings when it comes to the product's technical support part, making it an area where improvements are required. The time frame for the resolution is an area that needs to be improved. The overall communication part of the technical support team also needs improvement.
SurjitChoudhury - PeerSpot reviewer
Offers batch processing of data and in-memory processing in Spark greatly enhances performance
Spark supports real-time data processing through Spark Streaming. It allows for batch processing of data. If you have immediate data, like chat information, that needs to be processed in real-time, Spark Streaming is used. For data that can be evaluated later, batch processing with Apache Spark is suitable. Mostly, batch processing is utilized in our organization, but for streaming data processing, tools like Kafka are often integrated. In-memory processing in Spark greatly enhances performance, making it a hundred times faster than the previous MapReduce methods. This improvement is achieved through optimization techniques like caching, broadcasting, and partitioning, which help in optimizing queries for faster processing.

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Pros

"One valuable feature is that we can download data."
"It is a file system for data collection. There are nodes in this cluster that contain all the information, directories, and other files. The nodes are based on the MySQL database."
"I recommend it for the telecom sector. I know it well, and it's a good fit."
"Since both Apache Hadoop and Amazon EC2 are elastic in nature, we can scale and expand on demand for a specific PoC, and scale down when it's done."
"The most valuable feature is the database."
"The ability to add multiple nodes without any restriction is the solution's most valuable aspect."
"The most valuable features are the ability to process the machine data at a high speed, and to add structure to our data so that we can generate relevant analytics."
"​​Data ingestion: It has rapid speed, if Apache Accumulo is used."
"One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
"The fault tolerant feature is provided."
"The solution has been very stable."
"The deployment of the product is easy."
"The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."
"The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features."
"There's a lot of functionality."
"Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark. Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more."
 

Cons

"What could be improved in Apache Hadoop is its user-friendliness. It's not that user-friendly, but maybe it's because I'm new to it. Sometimes it feels so tough to use, but it could be because of two aspects: one is my incompetency, for example, I don't know about all the features of Apache Hadoop, or maybe it's because of the limitations of the platform. For example, my team is maintaining the business glossary in Apache Atlas, but if you want to change any settings at the GUI level, an advanced level of coding or programming needs to be done in the back end, so it's not user-friendly."
"I think more of the solution needs to be focused around the panel processing and retrieval of data."
"The key shortcoming is its inability to handle queries when there is insufficient memory. This limitation can be bypassed by processing the data in chunks."
"It would be helpful to have more information on how to best apply this solution to smaller organizations, with less data, and grow the data lake."
"The integration with Apache Hadoop with lots of different techniques within your business can be a challenge."
"The solution is not easy to use. The solution should be easy to use and suitable for almost any case connected with the use of big data or working with large amounts of data."
"There are certain shortcomings when it comes to the product's technical support part, making it an area where improvements are required."
"It needs better user interface (UI) functionalities."
"The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive."
"When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."
"Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."
"We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data."
"We are building our own queries on Spark, and it can be improved in terms of query handling."
"It's not easy to install."
"From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable."
"One limitation is that not all machine learning libraries and models support it."
 

Pricing and Cost Advice

"We don't directly pay for it. Our clients pay for it, and they usually don't complain about the price. So, it is probably acceptable."
"If my company can use the cloud version of Apache Hadoop, particularly the cloud storage feature, it would be easier and would cost less because an on-premises deployment has a higher cost during storage, for example, though I don't know exactly how much Apache Hadoop costs."
"Do take into consider that data storage and compute capacity scale differently and hence purchasing a "boxed" / 'all-in-one" solution (software and hardware) might not be the best idea."
"The price of Apache Hadoop could be less expensive."
"The price could be better. Hortonworks no longer exists, and Cloudera killed the free version of Hadoop."
"This is a low cost and powerful solution."
"The product is open-source, but some associated licensing fees depend on the subscription level."
"We just use the free version."
"I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten."
"Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."
"Apache Spark is an open-source tool."
"Spark is an open-source solution, so there are no licensing costs."
"We are using the free version of the solution."
"Apache Spark is an expensive solution."
"The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks."
"It is an open-source platform. We do not pay for its subscription."
report
Use our free recommendation engine to learn which Data Warehouse solutions are best for your needs.
824,067 professionals have used our research since 2012.
 

Answers from the Community

it_user1272297 - PeerSpot reviewer
Apr 19, 2020
Apr 19, 2020
I haven't used SQream personally. However, if you are only considering GPU based rdbms's please check the following https://hackernoon.com/which-gpu-database-is-right-for-me-6ceef6a17505
2 out of 4 answers
Russell Rothstein - PeerSpot reviewer
Jan 27, 2020
Morten, the most popular comparisons of SQream can be found here: https://www.itcentralstation.com/products/sqream-db-alternatives-and-competitors The top ones include Cassandra, MemSQL, MongoDB, and Vertica.
reviewer1219965 - PeerSpot reviewer
Jan 27, 2020
I haven't used SQream personally. However, if you are only considering GPU based rdbms's please check the following https://hackernoon.com/which-gpu-database-is-right-for-me-6ceef6a17505
 

Top Industries

By visitors reading reviews
Financial Services Firm
34%
Computer Software Company
10%
University
7%
Energy/Utilities Company
6%
Financial Services Firm
27%
Computer Software Company
13%
Manufacturing Company
8%
Retailer
5%
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Apache Hadoop?
It's primarily open source. You can handle huge data volumes and create your own views, workflows, and tables. I can also use it for real-time data streaming.
What is your experience regarding pricing and costs for Apache Hadoop?
The product is open-source, but some associated licensing fees depend on the subscription level. While it might be free for students, organizations typically need to pay for their subscriptions. Th...
What needs improvement with Apache Hadoop?
Hadoop lacks OLAP capabilities. I recommend adding a Delta Lake feature to make the data compatible with ACID properties. Also, video and audio streaming import issues could be improved to ensure p...
What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What is your experience regarding pricing and costs for Apache Spark?
Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud...
What needs improvement with Apache Spark?
The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Conseque...
 

Comparisons

 

Learn More

 

Overview

 

Sample Customers

Amazon, Adobe, eBay, Facebook, Google, Hulu, IBM, LinkedIn, Microsoft, Spotify, AOL, Twitter, University of Maryland, Yahoo!, Cornell University Web Lab
NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Find out what your peers are saying about Snowflake Computing, Oracle, Teradata and others in Data Warehouse. Updated: November 2024.
824,067 professionals have used our research since 2012.