Apache Spark vs Cloudera Distribution for Hadoop comparison

 

Comparison Buyer's Guide

Executive Summary
 

Categories and Ranking

Apache Spark
Ranking in Hadoop
1st
Average Rating
8.4
Number of Reviews
60
Ranking in other categories
Compute Service (5th), Java Frameworks (2nd)
Cloudera Distribution for H...
Ranking in Hadoop
2nd
Average Rating
8.0
Number of Reviews
47
Ranking in other categories
NoSQL Databases (5th)
 

Market share comparison

As of June 2024, in the Hadoop category, the market share of Apache Spark is 14.3% and it decreased by 38.9% compared to the previous year. The market share of Cloudera Distribution for Hadoop is 31.2% and it increased by 31.2% compared to the previous year. It is calculated based on PeerSpot user engagement data.
Hadoop
Unique Categories:
Compute Service
11.0%
Java Frameworks
6.5%
NoSQL Databases
2.0%
 

Featured Reviews

SurjitChoudhury - PeerSpot reviewer
Feb 20, 2024
Offers batch processing of data and in-memory processing in Spark greatly enhances performance
Spark supports real-time data processing through Spark Streaming. It allows for batch processing of data. If you have immediate data, like chat information, that needs to be processed in real-time, Spark Streaming is used. For data that can be evaluated later, batch processing with Apache Spark is suitable. Mostly, batch processing is utilized in our organization, but for streaming data processing, tools like Kafka are often integrated. In-memory processing in Spark greatly enhances performance, making it a hundred times faster than the previous MapReduce methods. This improvement is achieved through optimization techniques like caching, broadcasting, and partitioning, which help in optimizing queries for faster processing.
Miodrag-Stanic - PeerSpot reviewer
Dec 19, 2023
You can manage all services from one place in an integrated manner
We share company data leaks based on cloud data on their clusters We had a data warehouse before all the data. We can process a lot more data structures. The solution has data managers. You can manage all services from one place in an integrated manner. You don't have to manage the other…

Quotes from Members

We asked business professionals to review the solutions they use. Here are some excerpts of what they said:
 

Pros

"The good performance. The nice graphical management console. The long list of ML algorithms."
"AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI."
"I feel the streaming is its best feature."
"It is useful for handling large amounts of data. It is very useful for scientific purposes."
"The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily."
"With Spark, we parallelize our operations, efficiently accessing both historical and real-time data."
"The most valuable feature of Apache Spark is its flexibility."
"The scalability has been the most valuable aspect of the solution."
"The data science aspect of the solution is valuable."
"The solution is reliable and stable, it fits our requirements."
"The most valuable feature is Impala, the querying engine, which is very fast."
"The most valuable feature is Kubernetes."
"I don't see any performance issues."
"It has the best proxy, security, and support features compared to open-source products."
"The solution is stable."
"Very good end-to-end security features."
 

Cons

"One limitation is that not all machine learning libraries and models support it."
"They could improve the issues related to programming language for the platform."
"The solution must improve its performance."
"It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster."
"Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing."
"The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate."
"Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users."
"Apache Spark should add some resource management improvements to the algorithms."
"Currently, we are using many other tools such as Spark and Blade Job to improve the performance."
"Without the big data environment, we cannot store all of this data live. We have billions of records and terabytes of storage to be used. It's not an option actually for us to have a big data environment."
"Cloudera Distribution for Hadoop is not always completely stable in some cases, which can be a concern for big data solutions."
"There is a maximum of a one-gigabyte block size, which is an area of storage that can be improved upon."
"It would be useful if Cloudera had more tools like SQL Engines that offer the traditional relational database. We have to do a lot of work preparing the data outside Cloudera before getting it into the platform."
"The one thing that we struggled with predominately was support. Because it was relatively new, support was always a big issue and I think it's still a bit of an ongoing concern with the team currently managing it."
"It could be faster and more user-friendly."
"While the deployed product is generally functional, there are instances where it presents difficulties."
 

Pricing and Cost Advice

"It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project."
"They provide an open-source license for the on-premise version."
"Apache Spark is an open-source solution, and there is no cost involved in deploying the solution on-premises."
"The solution is affordable and there are no additional licensing costs."
"Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera."
"Spark is an open-source solution, so there are no licensing costs."
"On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing."
"It is an open-source solution, it is free of charge."
"When comparing with Oracle Sybase and SQL, it's cheaper. It's not expensive."
"The solution is expensive."
"I believe we pay for a three-year license."
"I haven't bought a license for this solution. I'm only using the Apache license version."
"It is an expensive product."
"The price is very high. The solution is expensive."
"The price could be better for the product."
"Cloudera requires a license to use."
report
Use our free recommendation engine to learn which Hadoop solutions are best for your needs.
787,061 professionals have used our research since 2012.
 

Top Industries

By visitors reading reviews
Financial Services Firm
25%
Computer Software Company
13%
Manufacturing Company
7%
Retailer
5%
Financial Services Firm
22%
Computer Software Company
15%
Educational Organization
9%
Manufacturing Company
8%
 

Company Size

By reviewers
Large Enterprise
Midsize Enterprise
Small Business
 

Questions from the Community

What do you like most about Apache Spark?
We use Spark to process data from different data sources.
What needs improvement with Apache Spark?
In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond
What do you like most about Cloudera Distribution for Hadoop?
The tool can be deployed using different container technologies, which makes it very scalable.
What is your experience regarding pricing and costs for Cloudera Distribution for Hadoop?
The tool is expensive. Overall, it's not a cheap software tool, and that is why only large enterprises who are mature enough and have an architecture that is complex enough opt for Cloudera, as its...
What needs improvement with Cloudera Distribution for Hadoop?
The tool's ability to be deployed on a cloud model is an area of concern where improvements are required. The tool works very well when deployed on an on-premises model. The deployment on a cloud p...
 

Learn More

 

Overview

 

Sample Customers

NASA JPL, UC Berkeley AMPLab, Amazon, eBay, Yahoo!, UC Santa Cruz, TripAdvisor, Taboola, Agile Lab, Art.com, Baidu, Alibaba Taobao, EURECOM, Hitachi Solutions
37signals, Adconion,adgooroo, Aggregate Knowledge, AMD, Apollo Group, Blackberry, Box, BT, CSC
Find out what your peers are saying about Apache Spark vs. Cloudera Distribution for Hadoop and other solutions. Updated: May 2024.
787,061 professionals have used our research since 2012.