Try our new research platform with insights from 80,000+ expert users

Apache Spark vs Cloudera Distribution for Hadoop comparison

 

Comparison Buyer's Guide

Executive Summary
 

Categories and Ranking

Apache Spark
Ranking in Hadoop
1st
Average Rating
8.4
Reviews Sentiment
7.7
Number of Reviews
64
Ranking in other categories
Compute Service (4th), Java Frameworks (2nd)
Cloudera Distribution for H...
Ranking in Hadoop
2nd
Average Rating
8.0
Reviews Sentiment
6.4
Number of Reviews
49
Ranking in other categories
NoSQL Databases (8th)
 

Mindshare comparison

As of December 2024, in the Hadoop category, the mindshare of Apache Spark is 18.0%, down from 21.8% compared to the previous year. The mindshare of Cloudera Distribution for Hadoop is 28.2%, up from 23.1% compared to the previous year. It is calculated based on PeerSpot user engagement data.
Hadoop
 

Featured Reviews

SurjitChoudhury - PeerSpot reviewer
Offers batch processing of data and in-memory processing in Spark greatly enhances performance
Spark supports real-time data processing through Spark Streaming. It allows for batch processing of data. If you have immediate data, like chat information, that needs to be processed in real-time, Spark Streaming is used. For data that can be evaluated later, batch processing with Apache Spark is suitable. Mostly, batch processing is utilized in our organization, but for streaming data processing, tools like Kafka are often integrated. In-memory processing in Spark greatly enhances performance, making it a hundred times faster than the previous MapReduce methods. This improvement is achieved through optimization techniques like caching, broadcasting, and partitioning, which help in optimizing queries for faster processing.
Shahan Rehman - PeerSpot reviewer
Can host multiple technologies and help businesses with their AI initiatives
The ease or difficulty in setting up the product depends on the environment of the customer where the tool is deployed. If a banking, industrial, or retail sector firm is taken into concentration, depending on how big of a database is maintained, including the applications that are to be hosted, the deployment process can range from a simple to a very complex phase, depending on the architecture. For Cloudera Distribution for Hadoop, one has to go through the usual deployment process, like for any software product. You have to have different environments before going into production, like pre-production environments, test and dev environments. You install and configure all the components in the test environment and then test them on the pre-production environment. Once UAT is done, you move them to the production environment. In general, it's a critical product deployed in a company.

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Pros

"I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library."
"The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features."
"The distribution of tasks, like the seamless map-reduce functionality, is quite impressive."
"The solution is scalable."
"This solution provides a clear and convenient syntax for our analytical tasks."
"It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained."
"The product's initial setup phase was easy."
"The good performance. The nice graphical management console. The long list of ML algorithms."
"The product provides better data processing features than other tools."
"It has the best proxy, security, and support features compared to open-source products."
"Provides a viable open-source solution for enterprise implementations and reliable, intelligent data analysis."
"The solution is stable."
"CDH has a wide variety of proprietary tools that we use, like Impala. So from that perspective, it's quite useful as opposed to something open-source. We get a lot of value from Cloudera's proprietary tools."
"The main advantage is the storage is less expensive."
"The solution is reliable and stable, it fits our requirements."
"The solution's most valuable feature is the enterprise data platform."
 

Cons

"For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial."
"Apache Spark should add some resource management improvements to the algorithms."
"It should support more programming languages."
"The migration of data between different versions could be improved."
"One limitation is that not all machine learning libraries and models support it."
"The setup I worked on was really complex."
"It requires overcoming a significant learning curve due to its robust and feature-rich nature."
"From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable."
"The areas of improvement depend on the scale of the project. For banking customers, security features and an essential budget for commercial licenses would be the top priority. Data regulation could be the most crucial for a project with extensive data or an extra use case."
"We experienced many issues when we started working with Hadoop 3.0 in the Cloudera 6.0 version, so there is a lot of things that need to improve."
"I would like to see an improvement in how the solution helps me to handle the whole cluster."
"Without the big data environment, we cannot store all of this data live. We have billions of records and terabytes of storage to be used. It's not an option actually for us to have a big data environment."
"There are multiple bugs when we update."
"The solution is not fit for on-premise distributions."
"While the deployed product is generally functional, there are instances where it presents difficulties."
"Currently, we are using many other tools such as Spark and Blade Job to improve the performance."
 

Pricing and Cost Advice

"I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten."
"It is an open-source platform. We do not pay for its subscription."
"Apache Spark is an open-source solution, and there is no cost involved in deploying the solution on-premises."
"It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project."
"They provide an open-source license for the on-premise version."
"The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks."
"Apache Spark is an expensive solution."
"On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing."
"When comparing with Oracle Sybase and SQL, it's cheaper. It's not expensive."
"The price could be better for the product."
"The tool is not expensive."
"The solution is fairly expensive."
"It is an expensive product."
"Cloudera Distribution for Hadoop is expensive, with support costs involved."
"I wouldn't recommend CDH to others because of its high cost."
"I haven't bought a license for this solution. I'm only using the Apache license version."
report
Use our free recommendation engine to learn which Hadoop solutions are best for your needs.
824,053 professionals have used our research since 2012.
 

Top Industries

By visitors reading reviews
Financial Services Firm
27%
Computer Software Company
13%
Manufacturing Company
8%
Retailer
5%
Financial Services Firm
23%
Computer Software Company
15%
Educational Organization
11%
Manufacturing Company
8%
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What is your experience regarding pricing and costs for Apache Spark?
Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud...
What needs improvement with Apache Spark?
The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Conseque...
What do you like most about Cloudera Distribution for Hadoop?
The tool can be deployed using different container technologies, which makes it very scalable.
What is your experience regarding pricing and costs for Cloudera Distribution for Hadoop?
The tool is expensive. Overall, it's not a cheap software tool, and that is why only large enterprises who are mature enough and have an architecture that is complex enough opt for Cloudera, as its...
What needs improvement with Cloudera Distribution for Hadoop?
The tool doesn't support reporting, and relational databases are still the major source of reporting data. Apache Iceberg will be launched soon within the Cloudera cluster for analytical purposes. ...
 

Learn More

 

Overview

 

Sample Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
37signals, Adconion,adgooroo, Aggregate Knowledge, AMD, Apollo Group, Blackberry, Box, BT, CSC
Find out what your peers are saying about Apache Spark vs. Cloudera Distribution for Hadoop and other solutions. Updated: December 2024.
824,053 professionals have used our research since 2012.