Try our new research platform with insights from 80,000+ expert users
Salvatore Campana - PeerSpot reviewer
CEO & Founder at Xautomata
Real User
Top 5
Reduces startup time and gives excellent ROI
Pros and Cons
  • "Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term."
  • "The initial setup was not easy."

What is our primary use case?

I use Spark to run automation processes driven by data.

How has it helped my organization?

Apache Spark helped us with horizontal scalability and cost optimizations.

What is most valuable?

The most valuable feature is the grid computing.

What needs improvement?

An area for improvement is that when we start the solution and declare the maximum number of nodes, the process is shared, which is a problem in some cases. It would be useful to be able to change this parameter in real-time rather than having to stop the solution and restart with a higher number of nodes.

Buyer's Guide
Apache Spark
March 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2025.
839,422 professionals have used our research since 2012.

For how long have I used the solution?

I've been using Spark for around four years.

How was the initial setup?

The initial setup was not easy, but we created a means of asking the user about their needs, making the setup much easier. We can now deploy the platform in thirty minutes using the public cloud or Kubernetes space.

What was our ROI?

Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term.

What's my experience with pricing, setup cost, and licensing?

Spark is an open-source solution, so there are no licensing costs. 

What other advice do I have?

I would rate Apache Spark eight out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer1185906 - PeerSpot reviewer
Manager - Data Science Competency at a tech services company with 201-500 employees
Consultant
Fast-performance, cost-effective, and runs in a cloud-agnostic environment
Pros and Cons
  • "One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
  • "When you are working with large, complex tasks, the garbage collection process is slow and affects performance."

What is our primary use case?

My main task is working on predictive analytics, and Apache Spark is one of the tools that I utilize in this role. Primarily, we work with the predictive analysis of very large amounts of data.

Apache Spark is also helpful for data pre-processing, including data cleaning.

This solution is cloud-agnostic. You can use it with an EC2 instance and you can even install it on-premises. Some environments have it installed in VMs.

What is most valuable?

One of the key features is that Apache Spark is a distributed computing framework. You can have multiple slaves and distribute the workload between them.

Another feature is memory-based computing. This is unlike Hadoop, which relies on storage. As it uses in-memory data processing, Spark is very fast.

What needs improvement?

When you are working with large, complex tasks, the garbage collection process is slow and affects performance. This is an area where they need to improve because your job may fail if it is stuck for a long time while memory garbage collection is happening. This is the main problem that we have.

For how long have I used the solution?

I have been working with Apache Spark for the past four years.

What do I think about the stability of the solution?

This product is pretty stable. Companies like Facebook, Uber, and Netflix are all using Apache Spark. It's stable enough to be used all over the world.

What do I think about the scalability of the solution?

In our team that works on this, we have approximately 10 people.

How are customer service and support?

There is no official support for this solution. Because it's open-source and there is no cost involved, there is nobody to contact for support. Our own internal team of experts, which work on different problems, both support and contribute to the platform.

Which solution did I use previously and why did I switch?

I work on several open-source frameworks including Python, Scikit-learn, TensorFlow, PyTorch, H20.ai, and R. We don't endorse proprietary tools so we aren't working with them.

How was the initial setup?

With respect to the initial setup, it's neither easy nor very difficult. Our team has experience so it is not difficult for them. However, for a person that is new to using it, the setup might be very difficult.

What about the implementation team?

We have a team of experts in my company, and they handle it very well.

What's my experience with pricing, setup cost, and licensing?

This is an open-source tool, so it can be used free of charge. There is no cost involved.

What other advice do I have?

We are not using the current version of this platform, Spark 3. However, we do know that it is used in the market and it has new features. We will eventually move to it.

My advice for anybody who wants to use Apache Spark is that they have two options. The first is Databricks, which are the creators of Apache Spark, and use their proprietary version. If you choose this option then you will have to pay for the product.

If instead, you use Apache Spark, then you can rely on your own expert in-house team for support, maintenance, and deployment. In this option, you don't have to pay anything to anybody outside of your company.

I would rate this solution an eight out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Spark
March 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2025.
839,422 professionals have used our research since 2012.
it_user946074 - PeerSpot reviewer
Principal Architect at a financial services firm with 1,001-5,000 employees
Real User
Fast performance and has an easy initial setup
Pros and Cons
  • "I found the solution stable. We haven't had any problems with it."
  • "It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster."

What is our primary use case?

We use the solution for analytics.

How has it helped my organization?

I'm not sure how it has improved my organization but I believe that it's a good product.

What is most valuable?

The fast performance is the most valuable aspect of the solution.

What needs improvement?

The search could be improved. Usually, we are using other tools to search for specific stuff. We'll be using it how I use other tools - to get the details, but if there any way to search for little things that will be better.

It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster.

In the next release, if they can add more analytics, that would be useful. For example, for data, built data, if there was one port where you put the high one then you can pull any other close to you, and then maybe a log for the right script. 

For how long have I used the solution?

I've been using the solution for two years.

What do I think about the stability of the solution?

I found the solution stable. We haven't had any problems with it.

How are customer service and technical support?

Usually, we can fix any issues. If we have problems, we google a little bit to find the issue. 

Which solution did I use previously and why did I switch?

I was using some other systems and we moved to Spark later. We faced performance and other issues with the other solution.

How was the initial setup?

The initial setup was easy. We keep on getting data from different sources so we will keep on porting in little bits. It's not done in a single sitting, so I can't really say how long it takes.

What other advice do I have?

I would recommend the solution. I would rate it an eight or nine out of 10.

For some areas, I would give it ten but I cannot use some parts. If you are going to use it for a consumer then I would be able to recommend it and you should go ahead. It doesn't work for me as I have different clients and different engagements.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer1792824 - PeerSpot reviewer
Senior Test Automation Consultant / Architect at a tech services company with 11-50 employees
Consultant
Useful for big data and scientific purposes, but needs better query handling, stability, and scalability
Pros and Cons
  • "It is useful for handling large amounts of data. It is very useful for scientific purposes."
  • "We are building our own queries on Spark, and it can be improved in terms of query handling."

What is our primary use case?

We are using it for big data. We are using a small part of it, which is related to using data.

What is most valuable?

It is useful for handling large amounts of data. It is very useful for scientific purposes.

What needs improvement?

There are some difficulties that we are working on. It is useful for scientific purposes, but for commercial use of big data, it gives some trouble.

They should improve the stability of the product. We use Spark Executors and Spark Drivers to link to our own environment, and they are not the most stable products. Its scalability is also an issue.

We are building our own queries on Spark, and it can be improved in terms of query handling.

For how long have I used the solution?

In my company, it has been used for several years, but I have been using it for seven months.

What do I think about the scalability of the solution?

It is not scalable. Scalability is one of the issues.

How are customer service and support?

It is open source from my point of view. So, there is no support.

What other advice do I have?

I would advise not using it if you don't have experienced users inside your organization. If you have to figure it all out on your own, then you shouldn't start with it.

Overall, I would rate it a six out of 10. For a commercial use case, it is a six out of 10. For scientific purposes, it is an eight out of 10.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer879201 - PeerSpot reviewer
Technical Consultant at a tech services company with 1-10 employees
Consultant
Good Streaming features enable to enter data and analysis within Spark Stream
Pros and Cons
  • "I feel the streaming is its best feature."
  • "When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."

What is our primary use case?

We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.

What is most valuable?

I have worked with Hadoop a lot in my career and you need to do a lot of things to get it to Hello World. But in Spark it is easy. You could say it's an umbrella to do everything under the one shelf. It also has Spark Streaming. I feel the streaming is its best feature because I have extracted to enter data and analysis within Spark Stream.

What needs improvement?

I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist.

Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best.

Overall, it offers everything that I can imagine right now. 

For how long have I used the solution?

I have been using Apache Spark for a couple of months.

What do I think about the stability of the solution?

In terms of stability, I have not seen any bugs, glitches or crashes. Even if there is, that's fine, because I would probably take care of it and then I'd have progressed further in the process.

What do I think about the scalability of the solution?

I have not tested the scalability yet.

In my company, there are two or three people that are using it for different products. But right now, the client I'm engaged with doesn't know anything about Spark or Hadoop. They are a typical financial company so they do what they do, and they ask us to do everything. They have pretty much outsourced their whole big data initiative to us.

Which solution did I use previously and why did I switch?

I have used MapReduce from Hadoop previously. Otherwise, I haven't used any other big data infrastructure.

In my work previously, not in this company, I was working with some big data, but I was extracting using a single-core off my PC. I realized over time that my system had eight cores. So instead, I used all of those cores for multi-core programming. Then I realized that Hadoop and Spark do the same thing but with different PC's. That was then I used multi-core programming and that's the point - Spark needs to go and search Hadoop and other things.

How was the initial setup?

The initial setup to get it to Hello World is pretty easy, you just have to install it. But when you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources. But you can get a lot of help from different sources on the internet. So it's great. A lot of people are doing it.

I work with a startup company. You know that in startups you do not have the luxury of different people doing different things, you have to do everything on your own, and it's an opportunity to learn everything. In a typical corporate or big organization you only have restricted SOPs, you have to work within the boundaries. In my organization, I have to set up all the things, configure it, and work on it myself.

What's my experience with pricing, setup cost, and licensing?

I would suggest not to try to do everything at once. Identify the area where you want to solve the problem, start small and expand it incrementally, slowly expand your vision. For example, if I have a problem where I need to do streaming, just focus on the streaming and not on the machine learning that Spark offers. It offers a lot of things but you need to focus on one thing so that you can learn. That is what I have learned from the little experience I have with Spark. You need to focus on your objective and let the tools help you rather than the tools drive the work. That is my advice.

What other advice do I have?

On a scale of 1 to 10, I'd put it at an eight.

To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
PeerSpot user
Engineer at a tech vendor with 10,001+ employees
Real User
Spark provides lots of high-level APIs, which reduces duplication of work.

Valuable Features

Streaming data processing

Improvements to My Organization

In the previous version, we use Storm to handle real-time data, however its performance doesn't meet the requirement. Spark Streaming's micro-batch mode helps improving performance. Also, Spark provides lots of high-level APIs, which reduces duplication of work.

Room for Improvement

Better monitoring ability. Especially monitoring integration with customer codes.

Use of Solution

I've used it for one year.

Stability Issues

We met some standalone deployment issues, which showed that its stability is not that good. So we plan to switch to Yarn or Mesos mode

Customer Service and Technical Support

I have to say it is bad. I can only ask for help in the Google group. However, it is run in the developer-for-developer style. There are almost no people from databricks. I also use a Cassandra-Spark-connector, and Datastax has at least one dedicated person to help the community.

Initial Setup

Not that straightforward in terms of standalone deployment, there are some tricks which are not mentioned in the docs.

Implementation Team

We did it in-house.

Pricing, Setup Cost and Licensing

So far we have no plan to switch to commercial license.

Other Advice

I love Spark over other solutions.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Director of BigData Offer at IVIDATA
Real User
Stable, fast, and easy to use
Pros and Cons
  • "The solution is very stable."
  • "The solution needs to optimize shuffling between workers."

What is our primary use case?

We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers. 

What is most valuable?

It is a very fast solution. It's very easy to use. There are many RPis with many languages like Scala, Java, R, and Python. The greatest advantage of Spark is that we can initiate many kinds of analytics including SQL analytics, graphics analytics, etc. 

What needs improvement?

The solution needs to optimize shuffling between workers.

For how long have I used the solution?

I've been using the solution for four or five years.

What do I think about the stability of the solution?

The solution is very stable.

What do I think about the scalability of the solution?

The solution is scalable. My understanding is version 3.0 has renewed scaling capabilities and will be able to do so automatically.

How are customer service and technical support?

Apache is an open-source platform so there is no technical support.

What other advice do I have?

We use both on-premises and public and private cloud deployment models. We're partners with Databricks.

I'm a consultant. Our company works for large enterprises such as banks and energy companies. 17 of our workers use Apache Spark.

With the cloud, there are many companies that integrate Spark. Most projects in big data around the world use Spark, indirectly or directly. 

I'd rate the solution eight out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user326142 - PeerSpot reviewer
Architect at a healthcare company with 51-200 employees
Real User
Having everything in the same framework has helped us out a lot
Pros and Cons
  • "ETL and streaming capabilities."
  • "Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet)."

What is most valuable?

ETL and streaming capabilities.

How has it helped my organization?

Made Big Data processing more convenient and a uniform framework adds to efficiency of usage since the same framework can be used for batch and stream processing.

What needs improvement?

Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).

For how long have I used the solution?

I have used Spark since its inception in March 2015, from Spark 1.1 onwards.

Currently, I use 2.2 extensively.

What do I think about the stability of the solution?

Yes, occasionally with different APIs.

What do I think about the scalability of the solution?

No.

How are customer service and technical support?

Since we were using the Open Source version of Apache Spark, without the Databricks support, we never used technical support form Databricks.

Which solution did I use previously and why did I switch?

Yes we used Hive, Pig, and Storm. Having everything in the same framework has helped us out a lot.

Which other solutions did I evaluate?

Yes, we considered other big data products in the Big Data Ecosystem.

What other advice do I have?

Go for it.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user