Try our new research platform with insights from 80,000+ expert users

Amazon EMR vs Apache Spark comparison

 

Comparison Buyer's Guide

Executive Summary
 

Categories and Ranking

Amazon EMR
Ranking in Hadoop
3rd
Average Rating
7.8
Reviews Sentiment
7.2
Number of Reviews
22
Ranking in other categories
Cloud Data Warehouse (11th)
Apache Spark
Ranking in Hadoop
1st
Average Rating
8.4
Reviews Sentiment
7.7
Number of Reviews
64
Ranking in other categories
Compute Service (4th), Java Frameworks (2nd)
 

Mindshare comparison

As of December 2024, in the Hadoop category, the mindshare of Amazon EMR is 14.7%, down from 18.7% compared to the previous year. The mindshare of Apache Spark is 18.0%, down from 21.8% compared to the previous year. It is calculated based on PeerSpot user engagement data.
Hadoop
 

Featured Reviews

Prashant  Singh - PeerSpot reviewer
Easy to manage and reliable but the cost is hard to control
The cost is increasing. We are looking into how we can optimize the cost part of EMR. We're doing a comparison between Cloudera running on AWS and running AWS EMR. We don't have much control. If we have multiple users, if they want to scale up, the cost will go and increase and we don't know how we can restrict that price part.
SurjitChoudhury - PeerSpot reviewer
Offers batch processing of data and in-memory processing in Spark greatly enhances performance
Spark supports real-time data processing through Spark Streaming. It allows for batch processing of data. If you have immediate data, like chat information, that needs to be processed in real-time, Spark Streaming is used. For data that can be evaluated later, batch processing with Apache Spark is suitable. Mostly, batch processing is utilized in our organization, but for streaming data processing, tools like Kafka are often integrated. In-memory processing in Spark greatly enhances performance, making it a hundred times faster than the previous MapReduce methods. This improvement is achieved through optimization techniques like caching, broadcasting, and partitioning, which help in optimizing queries for faster processing.

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Pros

"The solution is pretty simple to set up."
"When we grade big jobs from on-prem to the cloud, we do it in EMR with Spark."
"One of the valuable features about this solution is that it's managed services, so it's pretty stable, and scalable as much as you wish. It has all the necessary distributions. With some additional work, it's also possible to change to a Spark version with the latest version of EMR. It also has Hudi, so we are leveraging Apache Hudi on EMR for change data capture, so then it comes out-of-the-box in EMR."
"The initial setup is pretty straightforward."
"The ability to resize the cluster is what really makes it stand out over other Hadoop and big data solutions."
"It allows users to access the data through a web interface."
"Amazon EMR has multiple connectors that can connect to various data sources."
"In Amazon EMR it is easy to rebuild anything, easy to upgrade and has good fault tolerance."
"The solution is scalable."
"Apache Spark can do large volume interactive data analysis."
"With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."
"The main feature that we find valuable is that it is very fast."
"The most valuable feature of Apache Spark is its flexibility."
"There's a lot of functionality."
"The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations."
"I appreciate everything about the solution, not just one or two specific features. The solution is highly stable. I rate it a perfect ten. The solution is highly scalable. I rate it a perfect ten. The initial setup was straightforward. I recommend using the solution. Overall, I rate the solution a perfect ten."
 

Cons

"We don't have much control. If we have multiple users, if they want to scale up, the cost will go and increase and we don't know how we can restrict that price part."
"The problem for us is it starts very slow."
"The solution can become expensive if you are not careful."
"Spark jobs take longer on Amazon EMR compared to previous experiences."
"The most complicated thing is configuring to the cluster and ensure it's running correctly."
"The product's features for storing data in static clusters could be better."
"The initial setup was time-consuming."
"The dashboard management could be better. Right now, it's lacking a bit."
"The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate."
"Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet)."
"Apache Spark lacks geospatial data."
"Apache Spark's GUI and scalability could be improved."
"They could improve the issues related to programming language for the platform."
"Apache Spark provides very good performance The tuning phase is still tricky."
"The logging for the observability platform could be better."
"The migration of data between different versions could be improved."
 

Pricing and Cost Advice

"The product is not cheap, but it is not expensive."
"There is no need to pay extra for third-party software."
"I rate the tool's pricing a five out of ten. It can be expensive since it's a managed service, and if you are not careful, you can run into unexpected charges. You can make a mistake that costs you tens of thousands of dollars. That's happened to us twice, so I'm sensitive to it. We're still trying to work on that. Our smallest client probably spends a hundred thousand dollars yearly on licensing, while our largest is well over a million."
"There is a small fee for the EMR system, but major cost components are the underlying infrastructure resources which we actually use."
"Amazon EMR is not very expensive."
"The cost of Amazon EMR is very high."
"The price of the solution is expensive."
"You don't need to pay for licensing on a yearly or monthly basis, you only pay for what you use, in terms of underlying instances."
"Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure. If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources."
"Apache Spark is an expensive solution."
"Spark is an open-source solution, so there are no licensing costs."
"The product is expensive, considering the setup."
"It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project."
"It is an open-source platform. We do not pay for its subscription."
"Apache Spark is an open-source tool."
"On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing."
report
Use our free recommendation engine to learn which Hadoop solutions are best for your needs.
824,067 professionals have used our research since 2012.
 

Top Industries

By visitors reading reviews
Financial Services Firm
25%
Computer Software Company
13%
Manufacturing Company
9%
Educational Organization
7%
Financial Services Firm
27%
Computer Software Company
13%
Manufacturing Company
8%
Retailer
5%
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Amazon EMR?
Amazon EMR is a good solution that can be used to manage big data.
What is your experience regarding pricing and costs for Amazon EMR?
The cost of Amazon EMR is a little bit expensive, especially considering the support package, which includes a gold package.
What needs improvement with Amazon EMR?
Spark jobs take longer on Amazon EMR compared to previous experiences. This aspect could be improved to make them more efficient.
What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What is your experience regarding pricing and costs for Apache Spark?
Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud...
What needs improvement with Apache Spark?
The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Conseque...
 

Also Known As

Amazon Elastic MapReduce
No data available
 

Overview

 

Sample Customers

Yelp
NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
Find out what your peers are saying about Amazon EMR vs. Apache Spark and other solutions. Updated: December 2024.
824,067 professionals have used our research since 2012.