Try our new research platform with insights from 80,000+ expert users
it_user371334 - PeerSpot reviewer
CEO at a tech consulting company with 51-200 employees
Consultant
It's enabled interactive self-service access to data​.

What is most valuable?

There are several valuable features.

  • Interactive data access (low latency)
  • Batch ETL-style processing
  • Schema-free data models
  • Algorithms

How has it helped my organization?

We have 1000x improvement in performance over other techniques. It's enabled interactive self-service access to data.

What needs improvement?

Better integration of BI tools wold be a much appreciated improvement.

For how long have I used the solution?

I've used it for about 14 months.

Buyer's Guide
Apache Spark
February 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: February 2025.
832,138 professionals have used our research since 2012.

What was my experience with deployment of the solution?

I haven't had any issues with deployment.

What do I think about the stability of the solution?

It's been stable for us.

What do I think about the scalability of the solution?

It's scaled without issue.

How are customer service and support?

Customer Service:

Customer service is excellent.

Technical Support:

Technical support is excellent.

Which solution did I use previously and why did I switch?

Yes, we previously used Oracle, from which we ported our data.

How was the initial setup?

The initial setup was simple.

What about the implementation team?

We implemented it with our in-house team.

What other advice do I have?

Be sure to Uuse the Apache versions and avoid vendor-specific extensions.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer1535340 - PeerSpot reviewer
Senior Solutions Architect at a retailer with 10,001+ employees
Real User
A unified analytics engine with a valuable parallel processing feature
Pros and Cons
  • "I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library."
  • "The logging for the observability platform could be better."

What is our primary use case?

We use Apache Spark to prepare data for transformation and encryption, depending on the columns. We use AES-256 encryption. We're building a proof of concept at the moment. We prepare patches on Spark for Kubernetes on-premise and Google Cloud Platform.

What is most valuable?

I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library.

What needs improvement?

The logging for the observability platform could be better.

For how long have I used the solution?

I know about this technology for a long time, maybe for about three years.

Which solution did I use previously and why did I switch?

Because my area is data analytics and analytics solutions, I use BigQuery, SQL, and ETL. I also use Dataproc and DataFlow.

What about the implementation team?

We use an integrator sometimes, but recently we put together a team to support the infrastructural requirements. This is because the proof of concept is self-administered.

What other advice do I have?

I would recommend Apache Spark to new users, but it depends on the use case. Sometimes, it's not the best solution.

On a scale from one to ten, I would give Apache Spark a ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Spark
February 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: February 2025.
832,138 professionals have used our research since 2012.
reviewer1046250 - PeerSpot reviewer
Senior Consultant & Training at a tech services company with 51-200 employees
Consultant
Easy to use and is capable of processing large amounts of data
Pros and Cons
  • "The most valuable feature of this solution is its capacity for processing large amounts of data."
  • "When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data."

What is our primary use case?

We use this solution for information gathering and processing. 

I use it myself when I am developing on my laptop.

I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.

What is most valuable?

The most valuable feature of this solution is its capacity for processing large amounts of data.

This solution makes it easy to do a lot of things. It's easy to read data, process it, save it, etc.

What needs improvement?

When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data. Once you are experienced, it is easier and more stable.

When you are trying to do something outside of the normal requirements in a typical project, it is difficult to find somebody with experience.

For how long have I used the solution?

I have been using this solution for between two and three years.

What do I think about the stability of the solution?

This solution is difficult for users who are just beginning and they experience out of memory errors when dealing with large amounts of data.

How are customer service and technical support?

I have not been in contact with technical support. I find all of the answers that I need in the forums.

What other advice do I have?

The work that we are doing with this solution is quite common and is very easy to do.

My advice for anybody who is implementing this solution is to look at their needs and then look at the community. Normally, there are a lot of people who have already done what you need. So, even without experience, it is quite simple to do a lot of things.

I would rate this solution a nine out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer894894 - PeerSpot reviewer
Works at a computer software company with 51-200 employees
User
Features include machine learning, real time streaming, and data processing. It doesn't enable spark job scheduling with monitoring capability.
Pros and Cons
  • "Features include machine learning, real time streaming, and data processing."
  • "The fault tolerant feature is provided."
  • "It provides a scalable machine learning library."
  • "It should support more programming languages."
  • "Needs to provide an internal schedule to schedule spark jobs with monitoring capability."

What is our primary use case?

Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.

How has it helped my organization?

It provides a scalable machine learning library so that we can train and predict user behavior for promotion purposes.

What is most valuable?

Machine learning, real time streaming, and data processing are fantastic, as well as the resilient or fault tolerant feature.

What needs improvement?

I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.

For how long have I used the solution?

Trial/evaluations only.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user373173 - PeerSpot reviewer
Lead Big Data Engineer at a non-profit with 51-200 employees
Vendor
​I use it to process large amount of data in the energy industry.

What is most valuable?

Spark is relatively easy to deploy, with rich features in handling big data. Spark Core, Spark SQL, Spark MLlib are used mostly in our applications.

How has it helped my organization?

I use Spark to process large amount of data in the energy industry.

What needs improvement?

Good tool to analyse Spark application performance. Right now there are still many parameters to tune in order to get good performance of Spark application, I would like to see the auto tuning of parameters.

For how long have I used the solution?

I've been using Spark for seven months.

What was my experience with deployment of the solution?

There were no issues with the deployment.

What do I think about the stability of the solution?

I ran into Spark application performance issues. For instance, Spark JDBC write performance needs to be improved.

What do I think about the scalability of the solution?

There were no issues with the scalability.

How are customer service and technical support?

Customer Service:

I use Apache open source. Everything is on our own.

Technical Support:

I use Apache open source. Everything is on our own.

Which solution did I use previously and why did I switch?

I evaluated Hadoop-based solution, and chose Spark due to the fast processing and ease of use.

How was the initial setup?

The initial setup is not complex. The online documents are pretty good.

What about the implementation team?

I implemented it in-house.

What other advice do I have?

Get to know how Spark works, what are job, stage, task, DAG, etc., and it will help you to write Spark application.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Managing Consultant at a computer software company with 501-1,000 employees
Real User
Good performance and resource management for hosting our data science platform
Pros and Cons
  • "The processing time is very much improved over the data warehouse solution that we were using."
  • "I would like to see integration with data science platforms to optimize the processing capability for these tasks."

What is our primary use case?

Our use case for Apache Spark was a retail price prediction project. We were using retail pricing data to build predictive models. To start, the prices were analyzed and we created the dataset to be visualized using Tableau. We then used a visualization tool to create dashboards and graphical reports to showcase the predictive modeling data.

Apache Spark was used to host this entire project.

How has it helped my organization?

The processing time is very much improved over the data warehouse solution that we were using.

What is most valuable?

The most valuable features are the storage engine, the memory engine, and the processing engine.

What needs improvement?

I would like to see integration with data science platforms to optimize the processing capability for these tasks.

For how long have I used the solution?

I have been using Apache Spark for the past year.

How are customer service and technical support?

We have not been in contact with technical support.

What's my experience with pricing, setup cost, and licensing?

The initial setup is straightforward. It took us around one week to set it up, and then the requirements and creation of the project flow and design needed to be done. The design stage took three to four weeks, so in total, it required between four and five weeks to set up.

What other advice do I have?

I would rate this solution an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user786777 - PeerSpot reviewer
Manager | Data Science Enthusiast | Management Consultant at a consultancy with 5,001-10,000 employees
Real User
We can now harness richer data sets and benefit from use cases
Pros and Cons
  • "With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."
  • "Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing."

How has it helped my organization?

Organisations can now harness richer data sets and benefit from use cases, which add value to their business functions.

What is most valuable?

Distributed in memory processing. Some of the algorithms are resource heavy and executing this requires a lot of RAM and CPU. With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.

What needs improvement?

Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.

For how long have I used the solution?

Three to five years.

What do I think about the stability of the solution?

At times when users do not know how to use Spark and request a lot of resources, then the underlying JVMs can crash, which is a big sense of worry. 

What do I think about the scalability of the solution?

No issues.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Snr Security Engineer at a tech vendor with 201-500 employees
Real User
Provides security analytics and has good scalability
Pros and Cons
  • "The scalability has been the most valuable aspect of the solution."
  • "The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive."

What is our primary use case?

We primarily use the solution for security analytics.

What is most valuable?

The scalability has been the most valuable aspect of the solution.

What needs improvement?

The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive. 

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The 2.3 version is quite stable. All of our customers use it, there are around 100,000+ users, and it runs 24/7.

What do I think about the scalability of the solution?

The scalability is very good.

How are customer service and technical support?

You actually buy Cloudera along with it. You don't really get any support, except you need support.

Which solution did I use previously and why did I switch?

In previous companies, we used MySQL platform and solutions like ArcSight and Splunk. We switched for scalability. MySQL wasn't going to scale, and we don't use Splunk at this company.

How was the initial setup?

The initial setup was complex. It is a complex tool. It's a lot to do with how you will use it. There is a lot to set up. They need to put a lot of scripts to it. There's nearly 60 to set up. When you set up the cloud, it takes about a day to set up. If you set it up on-premise, you know, on hardware, it only takes about a week.

What other advice do I have?

I would rate this solution eight out of 10. 

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user