Try our new research platform with insights from 80,000+ expert users

Apache Hadoop vs Apache Spark comparison

 

Comparison Buyer's Guide

Executive Summary
 

Categories and Ranking

Apache Hadoop
Average Rating
7.8
Number of Reviews
39
Ranking in other categories
Data Warehouse (6th)
Apache Spark
Average Rating
8.4
Reviews Sentiment
7.7
Number of Reviews
64
Ranking in other categories
Hadoop (1st), Compute Service (4th), Java Frameworks (2nd)
 

Mindshare comparison

Apache Hadoop and Apache Spark aren’t in the same category and serve different purposes. Apache Hadoop is designed for Data Warehouse and holds a mindshare of 5.1%, down 6.2% compared to last year.
Apache Spark, on the other hand, focuses on Hadoop, holds 18.2% mindshare, down 21.9% since last year.
Data Warehouse
Hadoop
 

Q&A Highlights

it_user1272297 - PeerSpot reviewer
Apr 19, 2020
 

Featured Reviews

Sushil Arya - PeerSpot reviewer
Provides ease of integration with the IT workflow of a business
When working with Kafka, I saw that the data came in an incremental order. The incremental data processing part is still not very effective in Apache Hadoop. If the data is already there, it can be processed very effectively, especially if the data is coming in every second. If you want to know the location of some data every second, then such data is not processed effectively in Apache Hadoop. I can say that one of the features where improvements are required revolves around the licensing cost of the tool. If the tool can build some licensing structures in a pay-per-use manner, organizations can get the look and feel of Apache Hadoop. Apache Hadoop can offer a licensing structure of the product that can be seen as similar to how AWS operates. Apache Hadoop can look into the capability of processing incremental data. The tool's setup process can be a scope of improvement. Also, it is not very simple because while doing the setup, we need to do all the server settings, including port listing and firewall configurations. If we look at other products on the market, then they can be made simpler. There are certain shortcomings when it comes to the product's technical support part, making it an area where improvements are required. The time frame for the resolution is an area that needs to be improved. The overall communication part of the technical support team also needs improvement.
SurjitChoudhury - PeerSpot reviewer
Offers batch processing of data and in-memory processing in Spark greatly enhances performance
Spark supports real-time data processing through Spark Streaming. It allows for batch processing of data. If you have immediate data, like chat information, that needs to be processed in real-time, Spark Streaming is used. For data that can be evaluated later, batch processing with Apache Spark is suitable. Mostly, batch processing is utilized in our organization, but for streaming data processing, tools like Kafka are often integrated. In-memory processing in Spark greatly enhances performance, making it a hundred times faster than the previous MapReduce methods. This improvement is achieved through optimization techniques like caching, broadcasting, and partitioning, which help in optimizing queries for faster processing.

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Pros

"The scalability of Apache Hadoop is very good."
"Hadoop can store any kind of data—structured, unstructured, and semi-structured—and presents it using the relational model through Hive."
"High throughput and low latency. We start with data mashing on Hive and finally use this for KPI visualization."
"The platform's quick data processing capabilities have been instrumental in supporting our AI-driven projects."
"The most valuable feature is the database."
"Since both Apache Hadoop and Amazon EC2 are elastic in nature, we can scale and expand on demand for a specific PoC, and scale down when it's done."
"Hadoop File System is compatible with almost all the query engines."
"The tool's stability is good."
"Provides a lot of good documentation compared to other solutions."
"DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort."
"We use it for ETL purposes as well as for implementing the full transformation pipelines."
"With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."
"The good performance. The nice graphical management console. The long list of ML algorithms."
"The most valuable feature of Apache Spark is its flexibility."
"We use Spark to process data from different data sources."
"There's a lot of functionality."
 

Cons

"There are certain shortcomings when it comes to the product's technical support part, making it an area where improvements are required."
"Improvements in security measures would be beneficial, given the large volumes of data handled."
"The integration with Apache Hadoop with lots of different techniques within your business can be a challenge."
"In the next release, I would like to see Hive more responsive for smaller queries and to reduce the latency."
"I would like to see more direct integration of visualization applications."
"I mentioned it definitely, and this is probably the only feature we can improve a little bit because the terminal and coding screen on Hadoop is a little outdated, and it looks like the old C++ bio screen. If the UI and UX can be improved slightly, I believe it will go a long way toward increasing adoption and effectiveness."
"General installation/dependency issues were there, but were not a major, complex issue. While migrating data from MySQL to Hive, things are a little challenging, but we were able to get through that with support from forums and a little trial and error."
"The upgrade path should be improved because it is not as easy as it should be."
"The solution’s integration with other platforms should be improved."
"Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."
"Needs to provide an internal schedule to schedule spark jobs with monitoring capability."
"Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing."
"When using Spark, users may need to write their own parallelization logic, which requires additional effort and expertise."
"Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet)."
"Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."
"Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial."
 

Pricing and Cost Advice

"Do take into consider that data storage and compute capacity scale differently and hence purchasing a "boxed" / 'all-in-one" solution (software and hardware) might not be the best idea."
"It's reasonable, but there's room for improvement in cost-effectiveness."
"We don't directly pay for it. Our clients pay for it, and they usually don't complain about the price. So, it is probably acceptable."
"This is a low cost and powerful solution."
"The price of Apache Hadoop could be less expensive."
"The product is open-source, but some associated licensing fees depend on the subscription level."
"​There are no licensing costs involved, hence money is saved on the software infrastructure​."
"If my company can use the cloud version of Apache Hadoop, particularly the cloud storage feature, it would be easier and would cost less because an on-premises deployment has a higher cost during storage, for example, though I don't know exactly how much Apache Hadoop costs."
"It is an open-source solution, it is free of charge."
"They provide an open-source license for the on-premise version."
"Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera."
"It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project."
"Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure. If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources."
"It is an open-source platform. We do not pay for its subscription."
"I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten."
"Spark is an open-source solution, so there are no licensing costs."
report
Use our free recommendation engine to learn which Data Warehouse solutions are best for your needs.
816,406 professionals have used our research since 2012.
 

Answers from the Community

it_user1272297 - PeerSpot reviewer
Apr 19, 2020
Apr 19, 2020
I haven't used SQream personally. However, if you are only considering GPU based rdbms's please check the following https://hackernoon.com/which-gpu-database-is-right-for-me-6ceef6a17505
2 out of 4 answers
Russell Rothstein - PeerSpot reviewer
Jan 27, 2020
Morten, the most popular comparisons of SQream can be found here: https://www.itcentralstation.com/products/sqream-db-alternatives-and-competitors The top ones include Cassandra, MemSQL, MongoDB, and Vertica.
reviewer1219965 - PeerSpot reviewer
Jan 27, 2020
I haven't used SQream personally. However, if you are only considering GPU based rdbms's please check the following https://hackernoon.com/which-gpu-database-is-right-for-me-6ceef6a17505
 

Top Industries

By visitors reading reviews
Financial Services Firm
32%
Computer Software Company
11%
University
7%
Energy/Utilities Company
6%
Financial Services Firm
27%
Computer Software Company
13%
Manufacturing Company
8%
University
5%
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Apache Hadoop?
It's primarily open source. You can handle huge data volumes and create your own views, workflows, and tables. I can also use it for real-time data streaming.
What is your experience regarding pricing and costs for Apache Hadoop?
The product is open-source, but some associated licensing fees depend on the subscription level. While it might be free for students, organizations typically need to pay for their subscriptions. Th...
What needs improvement with Apache Hadoop?
Hadoop lacks OLAP capabilities. I recommend adding a Delta Lake feature to make the data compatible with ACID properties. Also, video and audio streaming import issues could be improved to ensure p...
What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What is your experience regarding pricing and costs for Apache Spark?
Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud...
What needs improvement with Apache Spark?
The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Conseque...
 

Comparisons

 

Learn More

 

Overview

 

Sample Customers

Amazon, Adobe, eBay, Facebook, Google, Hulu, IBM, LinkedIn, Microsoft, Spotify, AOL, Twitter, University of Maryland, Yahoo!, Cornell University Web Lab
NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Find out what your peers are saying about Snowflake Computing, Oracle, Teradata and others in Data Warehouse. Updated: October 2024.
816,406 professionals have used our research since 2012.