We use the product for extensive data analysis. It helps us analyze a huge amount of data and transfer it to data scientists in our organization.
Vice President at Goldman Sachs at a computer software company with 10,001+ employees
Stable product with a valuable SQL tool
Pros and Cons
- "The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."
- "At the initial stage, the product provides no container logs to check the activity."
What is our primary use case?
What is most valuable?
The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it. It is a useful feature for us.
What needs improvement?
At the initial stage, the product provides no container logs to check the activity. It remains inactive for a long time without giving us any information. The containers could start quickly, similar to that of Jupyter Notebook.
For how long have I used the solution?
We have been using Apache Spark for eight months to one year.
Buyer's Guide
Apache Spark
February 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: February 2025.
832,138 professionals have used our research since 2012.
What do I think about the stability of the solution?
It is a stable product. I rate its stability an eight out of ten.
What do I think about the scalability of the solution?
We have 45 Apache Spark users. I rate its scalability a nine out of ten.
How was the initial setup?
The complexity of the initial setup depends on the kind of environment an organization is working with. It requires one executive for deployment. I rate the process an eight out of ten.
What's my experience with pricing, setup cost, and licensing?
The product is expensive, considering the setup. However, from a standalone perspective, it is inexpensive.
What other advice do I have?
I advise others to analyze data and understand your business requirements before purchasing the product. I rate it an eight out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
CEO & Founder at Xautomata
Reduces startup time and gives excellent ROI
Pros and Cons
- "Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term."
- "The initial setup was not easy."
What is our primary use case?
I use Spark to run automation processes driven by data.
How has it helped my organization?
Apache Spark helped us with horizontal scalability and cost optimizations.
What is most valuable?
The most valuable feature is the grid computing.
What needs improvement?
An area for improvement is that when we start the solution and declare the maximum number of nodes, the process is shared, which is a problem in some cases. It would be useful to be able to change this parameter in real-time rather than having to stop the solution and restart with a higher number of nodes.
For how long have I used the solution?
I've been using Spark for around four years.
How was the initial setup?
The initial setup was not easy, but we created a means of asking the user about their needs, making the setup much easier. We can now deploy the platform in thirty minutes using the public cloud or Kubernetes space.
What was our ROI?
Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term.
What's my experience with pricing, setup cost, and licensing?
Spark is an open-source solution, so there are no licensing costs.
What other advice do I have?
I would rate Apache Spark eight out of ten.
Which deployment model are you using for this solution?
Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Apache Spark
February 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: February 2025.
832,138 professionals have used our research since 2012.
Manager - Data Science Competency at a tech services company with 201-500 employees
Fast-performance, cost-effective, and runs in a cloud-agnostic environment
Pros and Cons
- "One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
- "When you are working with large, complex tasks, the garbage collection process is slow and affects performance."
What is our primary use case?
My main task is working on predictive analytics, and Apache Spark is one of the tools that I utilize in this role. Primarily, we work with the predictive analysis of very large amounts of data.
Apache Spark is also helpful for data pre-processing, including data cleaning.
This solution is cloud-agnostic. You can use it with an EC2 instance and you can even install it on-premises. Some environments have it installed in VMs.
What is most valuable?
One of the key features is that Apache Spark is a distributed computing framework. You can have multiple slaves and distribute the workload between them.
Another feature is memory-based computing. This is unlike Hadoop, which relies on storage. As it uses in-memory data processing, Spark is very fast.
What needs improvement?
When you are working with large, complex tasks, the garbage collection process is slow and affects performance. This is an area where they need to improve because your job may fail if it is stuck for a long time while memory garbage collection is happening. This is the main problem that we have.
For how long have I used the solution?
I have been working with Apache Spark for the past four years.
What do I think about the stability of the solution?
This product is pretty stable. Companies like Facebook, Uber, and Netflix are all using Apache Spark. It's stable enough to be used all over the world.
What do I think about the scalability of the solution?
In our team that works on this, we have approximately 10 people.
How are customer service and support?
There is no official support for this solution. Because it's open-source and there is no cost involved, there is nobody to contact for support. Our own internal team of experts, which work on different problems, both support and contribute to the platform.
Which solution did I use previously and why did I switch?
I work on several open-source frameworks including Python, Scikit-learn, TensorFlow, PyTorch, H20.ai, and R. We don't endorse proprietary tools so we aren't working with them.
How was the initial setup?
With respect to the initial setup, it's neither easy nor very difficult. Our team has experience so it is not difficult for them. However, for a person that is new to using it, the setup might be very difficult.
What about the implementation team?
We have a team of experts in my company, and they handle it very well.
What's my experience with pricing, setup cost, and licensing?
This is an open-source tool, so it can be used free of charge. There is no cost involved.
What other advice do I have?
We are not using the current version of this platform, Spark 3. However, we do know that it is used in the market and it has new features. We will eventually move to it.
My advice for anybody who wants to use Apache Spark is that they have two options. The first is Databricks, which are the creators of Apache Spark, and use their proprietary version. If you choose this option then you will have to pay for the product.
If instead, you use Apache Spark, then you can rely on your own expert in-house team for support, maintenance, and deployment. In this option, you don't have to pay anything to anybody outside of your company.
I would rate this solution an eight out of ten.
Which deployment model are you using for this solution?
Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Principal Architect at a financial services firm with 1,001-5,000 employees
Fast performance and has an easy initial setup
Pros and Cons
- "I found the solution stable. We haven't had any problems with it."
- "It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster."
What is our primary use case?
We use the solution for analytics.
How has it helped my organization?
I'm not sure how it has improved my organization but I believe that it's a good product.
What is most valuable?
The fast performance is the most valuable aspect of the solution.
What needs improvement?
The search could be improved. Usually, we are using other tools to search for specific stuff. We'll be using it how I use other tools - to get the details, but if there any way to search for little things that will be better.
It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster.
In the next release, if they can add more analytics, that would be useful. For example, for data, built data, if there was one port where you put the high one then you can pull any other close to you, and then maybe a log for the right script.
For how long have I used the solution?
I've been using the solution for two years.
What do I think about the stability of the solution?
I found the solution stable. We haven't had any problems with it.
How are customer service and technical support?
Usually, we can fix any issues. If we have problems, we google a little bit to find the issue.
Which solution did I use previously and why did I switch?
I was using some other systems and we moved to Spark later. We faced performance and other issues with the other solution.
How was the initial setup?
The initial setup was easy. We keep on getting data from different sources so we will keep on porting in little bits. It's not done in a single sitting, so I can't really say how long it takes.
What other advice do I have?
I would recommend the solution. I would rate it an eight or nine out of 10.
For some areas, I would give it ten but I cannot use some parts. If you are going to use it for a consumer then I would be able to recommend it and you should go ahead. It doesn't work for me as I have different clients and different engagements.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Senior Test Automation Consultant / Architect at a tech services company with 11-50 employees
Useful for big data and scientific purposes, but needs better query handling, stability, and scalability
Pros and Cons
- "It is useful for handling large amounts of data. It is very useful for scientific purposes."
- "We are building our own queries on Spark, and it can be improved in terms of query handling."
What is our primary use case?
We are using it for big data. We are using a small part of it, which is related to using data.
What is most valuable?
It is useful for handling large amounts of data. It is very useful for scientific purposes.
What needs improvement?
There are some difficulties that we are working on. It is useful for scientific purposes, but for commercial use of big data, it gives some trouble.
They should improve the stability of the product. We use Spark Executors and Spark Drivers to link to our own environment, and they are not the most stable products. Its scalability is also an issue.
We are building our own queries on Spark, and it can be improved in terms of query handling.
For how long have I used the solution?
In my company, it has been used for several years, but I have been using it for seven months.
What do I think about the scalability of the solution?
It is not scalable. Scalability is one of the issues.
How are customer service and support?
It is open source from my point of view. So, there is no support.
What other advice do I have?
I would advise not using it if you don't have experienced users inside your organization. If you have to figure it all out on your own, then you shouldn't start with it.
Overall, I would rate it a six out of 10. For a commercial use case, it is a six out of 10. For scientific purposes, it is an eight out of 10.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Engineer at a tech vendor with 10,001+ employees
Spark provides lots of high-level APIs, which reduces duplication of work.
Valuable Features
Streaming data processing
Improvements to My Organization
In the previous version, we use Storm to handle real-time data, however its performance doesn't meet the requirement. Spark Streaming's micro-batch mode helps improving performance. Also, Spark provides lots of high-level APIs, which reduces duplication of work.
Room for Improvement
Better monitoring ability. Especially monitoring integration with customer codes.
Use of Solution
I've used it for one year.
Stability Issues
We met some standalone deployment issues, which showed that its stability is not that good. So we plan to switch to Yarn or Mesos mode
Customer Service and Technical Support
I have to say it is bad. I can only ask for help in the Google group. However, it is run in the developer-for-developer style. There are almost no people from databricks. I also use a Cassandra-Spark-connector, and Datastax has at least one dedicated person to help the community.
Initial Setup
Not that straightforward in terms of standalone deployment, there are some tricks which are not mentioned in the docs.
Implementation Team
We did it in-house.
Pricing, Setup Cost and Licensing
So far we have no plan to switch to commercial license.
Other Advice
I love Spark over other solutions.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Director of BigData Offer at IVIDATA
Stable, fast, and easy to use
Pros and Cons
- "The solution is very stable."
- "The solution needs to optimize shuffling between workers."
What is our primary use case?
We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers.
What is most valuable?
It is a very fast solution. It's very easy to use. There are many RPis with many languages like Scala, Java, R, and Python. The greatest advantage of Spark is that we can initiate many kinds of analytics including SQL analytics, graphics analytics, etc.
What needs improvement?
The solution needs to optimize shuffling between workers.
For how long have I used the solution?
I've been using the solution for four or five years.
What do I think about the stability of the solution?
The solution is very stable.
What do I think about the scalability of the solution?
The solution is scalable. My understanding is version 3.0 has renewed scaling capabilities and will be able to do so automatically.
How are customer service and technical support?
Apache is an open-source platform so there is no technical support.
What other advice do I have?
We use both on-premises and public and private cloud deployment models. We're partners with Databricks.
I'm a consultant. Our company works for large enterprises such as banks and energy companies. 17 of our workers use Apache Spark.
With the cloud, there are many companies that integrate Spark. Most projects in big data around the world use Spark, indirectly or directly.
I'd rate the solution eight out of ten.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Architect at a healthcare company with 51-200 employees
Having everything in the same framework has helped us out a lot
Pros and Cons
- "ETL and streaming capabilities."
- "Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet)."
What is most valuable?
ETL and streaming capabilities.
How has it helped my organization?
Made Big Data processing more convenient and a uniform framework adds to efficiency of usage since the same framework can be used for batch and stream processing.
What needs improvement?
Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).
For how long have I used the solution?
I have used Spark since its inception in March 2015, from Spark 1.1 onwards.
Currently, I use 2.2 extensively.
What do I think about the stability of the solution?
Yes, occasionally with different APIs.
What do I think about the scalability of the solution?
No.
How are customer service and technical support?
Since we were using the Open Source version of Apache Spark, without the Databricks support, we never used technical support form Databricks.
Which solution did I use previously and why did I switch?
Yes we used Hive, Pig, and Storm. Having everything in the same framework has helped us out a lot.
Which other solutions did I evaluate?
Yes, we considered other big data products in the Big Data Ecosystem.
What other advice do I have?
Go for it.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Updated: February 2025
Popular Comparisons
Amazon EMR
Cloudera Distribution for Hadoop
Spark SQL
IBM Spectrum Computing
Hortonworks Data Platform
Informatica Big Data Parser
IBM Db2 Big SQL
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which is the best RDMBS solution for big data?
- Apache Spark without Hadoop -- Is this recommended?
- Which solution has better performance: Spring Boot or Apache Spark?
- AWS EMR vs Hadoop
- Handling real and fast data - how do BigInsight and other solutions perform?
- When evaluating Hadoop, what aspect do you think is the most important to look for?
- Should we choose InfoSphere BigInsights or Cloudera?