Try our new research platform with insights from 80,000+ expert users
reviewer1283880 - PeerSpot reviewer
CEO International Business at a tech services company with 1,001-5,000 employees
MSP
Top 5
A powerful open-source framework for fast, flexible, and versatile big data processing, with a strong learning curve and resource demands
Pros and Cons
  • "The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations."
  • "It requires overcoming a significant learning curve due to its robust and feature-rich nature."

What is our primary use case?

In AI deployment, a key step is aggregating data from various sources, such as customer websites, debt records, and asset information. Apache Spark plays a vital role in this process, efficiently handling continuous streams of data. Its capability enables seamless gathering and feeding of diverse data into the system, facilitating effective processing and analysis for generating alerts and insights, particularly in scenarios like banking. 

What is most valuable?

The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations.

What needs improvement?

It requires overcoming a significant learning curve due to its robust and feature-rich nature.

For how long have I used the solution?

We have been using it for two years now.

Buyer's Guide
Apache Spark
February 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: February 2025.
832,138 professionals have used our research since 2012.

What do I think about the stability of the solution?

It provides excellent stability. We never faced any issues with it.

What do I think about the scalability of the solution?

It ensures outstanding scalability capabilities.

Which solution did I use previously and why did I switch?

Opting for Apache Spark, an open-source solution, provides a distinct advantage by offering control over the code. This means you can identify issues, make necessary fixes, and determine what aspects to accept as they are. In contrast, dealing with a vendor may limit control, requiring you to submit requests and advocate for changes based on your business volume with them. This dependency on volume can potentially compromise control. To safeguard both your customers and your business, the choice of an open-source solution like Apache Spark allows for more autonomy and control over the technology stack.

What about the implementation team?

The system's smooth operation relies on deploying a comprehensive container with Kubernetes clusters, configured with essential toolsets. Instrumentation data from the backend is fed back to a central framework equipped with specific tools for driving various processes. In a case involving a customer with Red Hat and Postini clusters, the OpenShift Container Platform, comprising Kubernetes clusters, is used. The tools manage onboarding, infrastructure provisioning, certificate management, authorization control, etc. The deployment spans multiple independent data centers, like telecom circles in India, requiring unique approaches for various tasks, including disaster recovery planning and central alerting, facilitated through SaaS. The deployment process typically takes approximately forty to forty-five days for six thousand servers.

What was our ROI?

It provides a dual advantage by saving both time and money while enhancing performance, particularly by leveraging my skill sets. 

What's my experience with pricing, setup cost, and licensing?

It is an open-source solution, it is free of charge.

What other advice do I have?

I would give it a rating of seven out of ten, which, by my standards, is quite high.

Which deployment model are you using for this solution?

Private Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Mahdi Sharifmousavi - PeerSpot reviewer
Lecturer at Amirkabir University of Technology
Real User
A scalable solution that can grow with the needs of a business, and provides excellent functionality for analytical tasks
Pros and Cons
  • "This solution provides a clear and convenient syntax for our analytical tasks."
  • "This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed."

What is our primary use case?

We use this solution for it's anti-money laundering and direct marketing features within a banking environment.

What is most valuable?

This solution provides a clear and convenient syntax for our analytical tasks.

What needs improvement?

This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed.

There is also limited Python compatibility, which should be improved.

For how long have I used the solution?

We have used this solution for around seven years, through several versions.

What do I think about the stability of the solution?

We have found this solution to be stable during our time using it.

What do I think about the scalability of the solution?

This is a very scalable solution from our experience.

What about the implementation team?

We implemented the solution using our in-house team, but the UI was developed using a third party contractor.

What's my experience with pricing, setup cost, and licensing?

The deployment time of this solution is dependent on the requirements of an organization, and the compatibility of the systems they will be using alongside this solution. We would recommend that these are clearly defined when designing the product for the businesses needs.

What other advice do I have?

I would rate this solution a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Spark
February 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: February 2025.
832,138 professionals have used our research since 2012.
PeerSpot user
Director of Enginnering at Sigmoid
Real User
Easy to code, fast, open-source, very scalable, and great for big data
Pros and Cons
  • "Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica. Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark."
  • "Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."

What is our primary use case?

I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.

How has it helped my organization?

Spark has been at the forefront of data processing engine. I have used Apache Spark for multiple projects for different clients. It is an excellent tool to process massive amount of data. 

What is most valuable?

Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica.

Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark.

What needs improvement?

Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.

For how long have I used the solution?

I have been using this solution for around 7 years.

What do I think about the stability of the solution?

There were bugs three to four years ago, which have been resolved. There were a couple of issues related to slowness when we did a lot of transformations using the Width columns. I was writing a POC on ETL for moving from Informatica to Spark SQL for the ETL pipeline. It required the use of hundreds of Width columns to change the column name or add some transformation, which made it slow. It happened in versions prior to version 1.6, and it seems that this issue has been fixed later on.

What do I think about the scalability of the solution?

It is very scalable. You can scale it a lot.

How are customer service and support?

I haven't contacted them.

How was the initial setup?

The initial setup was a little complex when I was using open-source Spark. I was doing a POC in the on-premise environment, and the initial setup was a little cumbersome. It required a lot of set up on Unix systems. We also had to do a lot of configurations and install a lot of things. 

After I moved to the Cloudera CDH version, it was a little easy. It is a bundled product, so you just install whatever you want and use it.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera.

What other advice do I have?

I would definitely recommend Spark. It is a great product. I like Spark a lot, and most of the features have been quite good. Its initial learning curve is a bit high, but as you learn it, it becomes very easy.

I would rate Apache Spark an eight out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Suresh_Srinivasan - PeerSpot reviewer
Co-Founder at FORMCEPT Technologies
Real User
Top 10
Handles large volume data, cloud and on-premise deployments, but difficult to use
Pros and Cons
  • "Apache Spark can do large volume interactive data analysis."
  • "Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn."

What is our primary use case?

The solution can be deployed on the cloud or on-premise.

How has it helped my organization?

We are using Apache Spark, for large volume interactive data analysis.

MechBot is an enterprise, one-click installation, trusted data excellence platform. Underneath, I am using Apache Spark, Kafka, Hadoop HDFS, and Elasticsearch.

What is most valuable?

Apache Spark can do large volume interactive data analysis.

What needs improvement?

Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.

For how long have I used the solution?

I have been using Apache Spark for approximately 11 years.

What do I think about the stability of the solution?

The solution is stable.

What do I think about the scalability of the solution?

Apache Spark is scalable. However, it needs enormous technical skills to make it scalable. It is not a simple task.

We have approximately 20 people using this solution.

How was the initial setup?

If you want to distribute Apache Spark in a certain way, it is simple. Not every engineer can do it. You need DevOps specialized skills on Spark is what is required.

If we are going to deploy the solution in a one-layer laptop installation, it is very straightforward, but this is not what someone is going to deploy in the production site.

What's my experience with pricing, setup cost, and licensing?

Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free.

What other advice do I have?

We are well versed in Spark, the version, the internal structure of Spark, and we know what exactly Spark is doing. 

The solution cannot be easier. Everything cannot be made simpler because it involves core data, computer science, pro-engineering, and not many people are actually aware of it.

I rate Apache Spark a six out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer879201 - PeerSpot reviewer
Technical Consultant at a tech services company with 1-10 employees
Consultant
Good Streaming features enable to enter data and analysis within Spark Stream
Pros and Cons
  • "I feel the streaming is its best feature."
  • "When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."

What is our primary use case?

We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.

What is most valuable?

I have worked with Hadoop a lot in my career and you need to do a lot of things to get it to Hello World. But in Spark it is easy. You could say it's an umbrella to do everything under the one shelf. It also has Spark Streaming. I feel the streaming is its best feature because I have extracted to enter data and analysis within Spark Stream.

What needs improvement?

I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist.

Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best.

Overall, it offers everything that I can imagine right now. 

For how long have I used the solution?

I have been using Apache Spark for a couple of months.

What do I think about the stability of the solution?

In terms of stability, I have not seen any bugs, glitches or crashes. Even if there is, that's fine, because I would probably take care of it and then I'd have progressed further in the process.

What do I think about the scalability of the solution?

I have not tested the scalability yet.

In my company, there are two or three people that are using it for different products. But right now, the client I'm engaged with doesn't know anything about Spark or Hadoop. They are a typical financial company so they do what they do, and they ask us to do everything. They have pretty much outsourced their whole big data initiative to us.

Which solution did I use previously and why did I switch?

I have used MapReduce from Hadoop previously. Otherwise, I haven't used any other big data infrastructure.

In my work previously, not in this company, I was working with some big data, but I was extracting using a single-core off my PC. I realized over time that my system had eight cores. So instead, I used all of those cores for multi-core programming. Then I realized that Hadoop and Spark do the same thing but with different PC's. That was then I used multi-core programming and that's the point - Spark needs to go and search Hadoop and other things.

How was the initial setup?

The initial setup to get it to Hello World is pretty easy, you just have to install it. But when you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources. But you can get a lot of help from different sources on the internet. So it's great. A lot of people are doing it.

I work with a startup company. You know that in startups you do not have the luxury of different people doing different things, you have to do everything on your own, and it's an opportunity to learn everything. In a typical corporate or big organization you only have restricted SOPs, you have to work within the boundaries. In my organization, I have to set up all the things, configure it, and work on it myself.

What's my experience with pricing, setup cost, and licensing?

I would suggest not to try to do everything at once. Identify the area where you want to solve the problem, start small and expand it incrementally, slowly expand your vision. For example, if I have a problem where I need to do streaming, just focus on the streaming and not on the machine learning that Spark offers. It offers a lot of things but you need to focus on one thing so that you can learn. That is what I have learned from the little experience I have with Spark. You need to focus on your objective and let the tools help you rather than the tools drive the work. That is my advice.

What other advice do I have?

On a scale of 1 to 10, I'd put it at an eight.

To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Onur Tokat - PeerSpot reviewer
Big Data Engineer Consultant at Collective[i]
Consultant
Scala-based solution with good data evaluation functions and distribution
Pros and Cons
  • "Spark can handle small to huge data and is suitable for any size of company."
  • "Spark could be improved by adding support for other open-source storage layers than Delta Lake."

What is our primary use case?

I mainly use Spark to prepare data for processing because it has APIs for data evaluation. 

What is most valuable?

The most valuable feature is that Spark uses Scala, which has good data evaluation functions. Spark also supports good distribution on the clusters and provides optimization on the APIs.

What needs improvement?

Spark could be improved by adding support for other open-source storage layers than Delta Lake. The UI could also be enhanced to give more data on resource management.

For how long have I used the solution?

I've been using Spark for six years.

What do I think about the stability of the solution?

Generally, Spark works correctly without any errors. It may give out some errors if your data changes, but in that case, it's a problem with the configuration, not with Spark.

What do I think about the scalability of the solution?

The cloud version of Spark is very easy to scale.

How was the initial setup?

The initial setup is not complex, but it depends on the product's component on the architecture. For example, if you use Hadoop, setup may not be easy. Deployment takes about a week, but the Spark cluster can be installed in the virtual architecture in a day.

What other advice do I have?

Spark can handle small to huge data and is suitable for any size of company. I would rate Spark as eight out of ten. 

Which deployment model are you using for this solution?

On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user
it_user372393 - PeerSpot reviewer
Big Data Consultant at a tech services company with 501-1,000 employees
Consultant
We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.
Pros and Cons
  • "The good performance. The nice graphical management console. The long list of ML algorithms."
  • "Apache Spark provides very good performance The tuning phase is still tricky."

What is most valuable?

The good performance. The nice graphical management console. The long list of ML algorithms.

How has it helped my organization?

We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.

What needs improvement?

Apache Spark provides very good performance The tuning phase is still tricky.

For how long have I used the solution?

I've used it for 2 years.

What was my experience with deployment of the solution?

We didn't have an issue with the deployment.

What do I think about the stability of the solution?

In the past we deployed Spark 1.3 to use Spark SQL but unfortunately one of our queries failed because of a bug fixed in following releases. Then we moved to Spark 1.6 but still some queries were failing when run against huge datasets. Now we are using version 2.1: it is more stable, it ensures better performances and the SQL/ML parts are reacher than before.

What do I think about the scalability of the solution?

I've had no issues with the scalability.

How is customer service and technical support?

Customer Service:

I've never had to use customer service.

Technical Support:

I've never had to use technical support.

How was the initial setup?

The initial set-up is quite complex because you have to set-up many different configuration parameters that are deployment-specific. It is not trivial to set-up the correct configuration with so many variables involved.

What about the implementation team?

In-house team. The setup itself is not a problem when you have just to test the system. The challenging part is discovering the optimal configuration needed to obtain a production system proving good performance.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user365304 - PeerSpot reviewer
Software Consultant at a tech services company with 10,001+ employees
Real User
It provides large scale data processing with negligible latency at the cost of commodity hardwares.

Valuable Features:

The most important feature of Apache Spark is that it provides large scale data processing with negligible latency at the cost of commodity hardwares. Spark framework is just a blessings over Hadoop, as the later does not allow fast processing of data, which is accomplished by the in-memory data processing of Spark.

Improvements to My Organization:

Apache Spark is a framework, which allows one organization to perform business & data analytics, at a very low cost, as compared to Ab-Initio or Informatica. Thus, by using Apache Spark in place of those tools, one organization can achieve huge reduction in cost, & without compromising with any data security & other data related issues, if controlled by an expert Scala programmer  & Apache Spark does not bear the overheads of Hadoop of having high latency. All these points, by which my organization is being benefitted as well.

Room for Improvement:

Question of improvement always comes to mind of the developers. Just like the most common need of the developers, if a user-friendly GUI along with 'drag & drop' feature can be attached to this framework, then it would be easier to access it.

Another thing to mention, there always is a place for improvement in terms of the memory usage. If in future, it is achievable to use less memory for processing, it would obviously be better.

Deployment Issues:

We've had no issues with deployment.

Stability Issues:

See above regarding memory usage.

Scalability Issues:

We've had no issues with scalability.

Other Advice:

My advice to others would be just to use Apache Spark for large scale data processing, as it provides good performance at low cost, unlike Ab-Initio or Informatica. But the main problem is, now in the market, there are not many people certified in Apache Spark.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
it_user371334 - PeerSpot reviewer
it_user371334CEO at a tech consulting company with 51-200 employees
Consultant

The drag and drop GUI comment is very true. We developed such a GUI for spatial and time series data in Spark. But there are other tools out there. Maybe you should do a review of such tools.