Try our new research platform with insights from 80,000+ expert users
it_user371334 - PeerSpot reviewer
CEO at a tech consulting company with 51-200 employees
Consultant
It's enabled interactive self-service access to data​.

What is most valuable?

There are several valuable features.

  • Interactive data access (low latency)
  • Batch ETL-style processing
  • Schema-free data models
  • Algorithms

How has it helped my organization?

We have 1000x improvement in performance over other techniques. It's enabled interactive self-service access to data.

What needs improvement?

Better integration of BI tools wold be a much appreciated improvement.

For how long have I used the solution?

I've used it for about 14 months.

Buyer's Guide
Apache Spark
November 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2024.
816,406 professionals have used our research since 2012.

What was my experience with deployment of the solution?

I haven't had any issues with deployment.

What do I think about the stability of the solution?

It's been stable for us.

What do I think about the scalability of the solution?

It's scaled without issue.

How are customer service and support?

Customer Service:

Customer service is excellent.

Technical Support:

Technical support is excellent.

Which solution did I use previously and why did I switch?

Yes, we previously used Oracle, from which we ported our data.

How was the initial setup?

The initial setup was simple.

What about the implementation team?

We implemented it with our in-house team.

What other advice do I have?

Be sure to Uuse the Apache versions and avoid vendor-specific extensions.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer1535340 - PeerSpot reviewer
Senior Solutions Architect at a retailer with 10,001+ employees
Real User
A unified analytics engine with a valuable parallel processing feature
Pros and Cons
  • "I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library."
  • "The logging for the observability platform could be better."

What is our primary use case?

We use Apache Spark to prepare data for transformation and encryption, depending on the columns. We use AES-256 encryption. We're building a proof of concept at the moment. We prepare patches on Spark for Kubernetes on-premise and Google Cloud Platform.

What is most valuable?

I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library.

What needs improvement?

The logging for the observability platform could be better.

For how long have I used the solution?

I know about this technology for a long time, maybe for about three years.

Which solution did I use previously and why did I switch?

Because my area is data analytics and analytics solutions, I use BigQuery, SQL, and ETL. I also use Dataproc and DataFlow.

What about the implementation team?

We use an integrator sometimes, but recently we put together a team to support the infrastructural requirements. This is because the proof of concept is self-administered.

What other advice do I have?

I would recommend Apache Spark to new users, but it depends on the use case. Sometimes, it's not the best solution.

On a scale from one to ten, I would give Apache Spark a ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Spark
November 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2024.
816,406 professionals have used our research since 2012.
reviewer1046250 - PeerSpot reviewer
Senior Consultant & Training at a tech services company with 51-200 employees
Consultant
Easy to use and is capable of processing large amounts of data
Pros and Cons
  • "The most valuable feature of this solution is its capacity for processing large amounts of data."
  • "When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data."

What is our primary use case?

We use this solution for information gathering and processing. 

I use it myself when I am developing on my laptop.

I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.

What is most valuable?

The most valuable feature of this solution is its capacity for processing large amounts of data.

This solution makes it easy to do a lot of things. It's easy to read data, process it, save it, etc.

What needs improvement?

When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data. Once you are experienced, it is easier and more stable.

When you are trying to do something outside of the normal requirements in a typical project, it is difficult to find somebody with experience.

For how long have I used the solution?

I have been using this solution for between two and three years.

What do I think about the stability of the solution?

This solution is difficult for users who are just beginning and they experience out of memory errors when dealing with large amounts of data.

How are customer service and technical support?

I have not been in contact with technical support. I find all of the answers that I need in the forums.

What other advice do I have?

The work that we are doing with this solution is quite common and is very easy to do.

My advice for anybody who is implementing this solution is to look at their needs and then look at the community. Normally, there are a lot of people who have already done what you need. So, even without experience, it is quite simple to do a lot of things.

I would rate this solution a nine out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user326142 - PeerSpot reviewer
Architect at a healthcare company with 51-200 employees
Real User
Having everything in the same framework has helped us out a lot
Pros and Cons
  • "ETL and streaming capabilities."
  • "Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet)."

What is most valuable?

ETL and streaming capabilities.

How has it helped my organization?

Made Big Data processing more convenient and a uniform framework adds to efficiency of usage since the same framework can be used for batch and stream processing.

What needs improvement?

Stability in terms of API (things were difficult, when transitioning from RDD to DataFrames, then to DataSet).

For how long have I used the solution?

I have used Spark since its inception in March 2015, from Spark 1.1 onwards.

Currently, I use 2.2 extensively.

What do I think about the stability of the solution?

Yes, occasionally with different APIs.

What do I think about the scalability of the solution?

No.

How are customer service and technical support?

Since we were using the Open Source version of Apache Spark, without the Databricks support, we never used technical support form Databricks.

Which solution did I use previously and why did I switch?

Yes we used Hive, Pig, and Storm. Having everything in the same framework has helped us out a lot.

Which other solutions did I evaluate?

Yes, we considered other big data products in the Big Data Ecosystem.

What other advice do I have?

Go for it.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user374040 - PeerSpot reviewer
Systems Engineering Lead, Mid-Atlantic at a tech company with 10,001+ employees
Vendor
It allows you to construct event-driven information systems.

Valuable Features

Spark Streaming, which allows you to construct event-driven information systems and respond to the events in near-real time.

Improvements to My Organization

Apache Spark’s ability to perform batch processing at one second or less intervals is the most transformative and less pervasive for any data processing application. The ingested data can also be validated and verified for quality early in the data pipeline.

Room for Improvement

Apache Spark as a data processing engine has come a long way since its inception. Although you are able to perform complex transformations using Spark libraries, the support for SQL to perform transformations is still limited. You can alleviate some of these limitations by running Spark within Hadoop ecosystem and by leveraging the fairly evolved HiveQL.

Use of Solution

I've used it for 16 months.

Deployment Issues

The enterprise scale deployment of Apache Spark is slightly involved to derive its full potential of stability, scalability and security. However, some Hadoop vendors like Cloudera have integrated Spark data processing engine into their Hadoop platforms and have made it easier to deploy, scale and secure.

Customer Service and Technical Support

This is an open source technology and is dependent on community support. The Apache Spark community is vibrant and it is easy to find answers to questions. The enterprises can also get commercial support from Hadoop vendors such as Cloudera. I recommend enterprises to inspect Hadoop vendors’ commitment to open source as well as their ability to curate Apache Spark technology into the rest of the ecosystem before signing up for a commercial support or subscription.

Initial Setup

The initial set-up is straightforward as long as you have picked a right Hadoop distribution.

Implementation Team

I recommend engaging an experienced Hadoop vendor during the planning and initial implementation phases of the project. You will be able to avoid any potential pitfalls or reduce overall project time by having a Hadoop expert guiding you during the initial stages of the project.

Other Solutions Considered

I evaluated some other technologies such as Samza but community backing for Apache Spark stood out.

Other Advice

I also suggest having a Chief Technologist who has extensive experience in architecting several Big Data solutions. They should be able to communicate in business as well as technology language. Their expertise should range from infrastructure to application development and have command of Hadoop technologies.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user1059558 - PeerSpot reviewer
Portfolio Manager, Enterprise Solutions Architect at Capgemini
Real User
Supports streaming and micro-batch

What is our primary use case?

Streaming telematics data.

How has it helped my organization?

It's a better MR, supports streaming and micro-batch, and supports Spark ML and Spark SQL.

What is most valuable?

It supports streaming and micro-batch.

What needs improvement?

Better data lineage support.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user374028 - PeerSpot reviewer
Core Engine Engineer at a computer software company with 51-200 employees
Real User
It makes web-based queries for plotting data easier. It needs to be simpler to use the machine learning algorithms supported by Octave.

Valuable Features

  • RDDs
  • DataFrames
  • Machine learning libraries

Improvements to My Organization

Faster time to parse and compute data. It makes web-based queries for plotting data easier.

Room for Improvement

It needs to be simpler to use the machine learning algorithms supported by Octave (example polynomial regressions, polynomial interpolation).

Use of Solution

I've been using it for one year.

Deployment Issues

There have been no issues with the deployment.

Stability Issues

There have been no issues with the stability.

Scalability Issues

There have been no issues with the scalability.

Customer Service and Technical Support

We still rely on user forums for my answers. We do not use a commercial product yet.

Initial Setup

The initial set-up was easy. I have not explored using this on AWS clusters.

Implementation Team

We did an in-house implementation and development for our regression tool.

ROI

The ROI will be an in-house product to do machine learning analytics on data obtained from customer.

Other Solutions Considered

We did not evaluate any other products.

Other Advice

It's easy to use and has a learning curve.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Managing Consultant at a computer software company with 501-1,000 employees
Real User
Good performance and resource management for hosting our data science platform
Pros and Cons
  • "The processing time is very much improved over the data warehouse solution that we were using."
  • "I would like to see integration with data science platforms to optimize the processing capability for these tasks."

What is our primary use case?

Our use case for Apache Spark was a retail price prediction project. We were using retail pricing data to build predictive models. To start, the prices were analyzed and we created the dataset to be visualized using Tableau. We then used a visualization tool to create dashboards and graphical reports to showcase the predictive modeling data.

Apache Spark was used to host this entire project.

How has it helped my organization?

The processing time is very much improved over the data warehouse solution that we were using.

What is most valuable?

The most valuable features are the storage engine, the memory engine, and the processing engine.

What needs improvement?

I would like to see integration with data science platforms to optimize the processing capability for these tasks.

For how long have I used the solution?

I have been using Apache Spark for the past year.

How are customer service and technical support?

We have not been in contact with technical support.

What's my experience with pricing, setup cost, and licensing?

The initial setup is straightforward. It took us around one week to set it up, and then the requirements and creation of the project flow and design needed to be done. The design stage took three to four weeks, so in total, it required between four and five weeks to set up.

What other advice do I have?

I would rate this solution an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user