Try our new research platform with insights from 80,000+ expert users
AmitMataghare - PeerSpot reviewer
Associate Director at a consultancy with 10,001+ employees
Real User
Top 10
High performance, beneficial in-memory support, and useful online community support
Pros and Cons
  • "One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."
  • "Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."

What is our primary use case?

Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.

What is most valuable?

One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast.

What needs improvement?

Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors.

For how long have I used the solution?

I have been using Apache Spark for approximately five years.

Buyer's Guide
Apache Spark
March 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2025.
839,422 professionals have used our research since 2012.

What do I think about the stability of the solution?

Apache Spark is stable.

What do I think about the scalability of the solution?

I have found Apache Spark to be scalable.

How are customer service and support?

Apache Spark is open-source, there is no team that will give you dedicated support, but you can post your queries on the community forums, and usually, you will receive a good response. Since it's open-source, you depend on freelance developers to respond to you, you cannot put a time limit there, but the response, on average, is pretty good.

How was the initial setup?

If Apache Spark is in the cloud, setting it up will require only minutes. If it's on Amazon, GCP, or Microsoft cloud, it'll take minutes to set everything up. However, if you are using the on-premise version, then it might take some time to set up the environment.

What other advice do I have?

I rate Apache Spark an eight out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Software Architect at Akbank
Real User
Provides fast aggregations, AI libraries, and a lot of connectors
Pros and Cons
  • "AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI."
  • "Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."

What is our primary use case?

We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. 

Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them.  

This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.

How has it helped my organization?

Aggregations are very fast in our project since we started to use Spark. We can tell results in around 300 milliseconds. Before using Spark, the time was around 700 milliseconds. 

Before using Spark, we only used Couchbase. We needed fast results for this project because transactions come from various channels, and we need to decide and resolve them at the earliest because users are performing the transaction. If our result or process takes longer, users might stop or cancel their transactions, which means losing money. Therefore, fast results time is very important for us.

What is most valuable?

AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI. 

What needs improvement?

Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.

For how long have I used the solution?

I am a Java developer. I have been interested in Spark for around five years. We have been actively using it in our organization for almost a year.

What do I think about the stability of the solution?

It is the most stable platform. As compare to Flink, Spark is good, especially in terms of clusters and architecture. My colleagues who set up these clusters say that Spark is the easiest.

What do I think about the scalability of the solution?

It is scalable, but we don't have the need to scale it. 

It is mainly used for reporting big data in our organization. All teams, especially the VR team, are using Spark for job execution and remote execution. I can say that 70% of users use Spark for reporting, calculations, and real-time operations. We are a very big company, and we have around a thousand people in IT.

We will continue its usage and develop more. We have kind of just started using it. We finished this project just three months ago. We are now trying to find out bottlenecks in our systems, and then we are ready to go.

How are customer service and technical support?

We have not used Apache support. We have only used Cloudera support for this project, and they helped us a lot during the development cycle of this project. 

How was the initial setup?

I don't have any idea about it. We are a big company, and we have another group for setting up Spark.

What other advice do I have?

I would advise planning well before implementing this solution. In enterprise corporations like ours, there are a lot of policies. You should first find out your needs, and after that, you or your team should set it up based on your needs. If your needs change during development because of the business requirements, it will be very difficult. 

If you are clear about your needs, it is easier to set it up. If you know how Spark is used in your project, you have to define firewall rules and cluster needs. When you set up Spark, it should be ready for people's usage, especially for remote job execution. 

I would rate Apache Spark a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Buyer's Guide
Apache Spark
March 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2025.
839,422 professionals have used our research since 2012.
Lucas Dreyer - PeerSpot reviewer
Data Engineer at BBD
Real User
Top 5Leaderboard
A reliable and scalable open-source framework for big data processing that excels in speed, fault tolerance, and support for various data sources
Pros and Cons
  • "It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained."
  • "One limitation is that not all machine learning libraries and models support it."

What is our primary use case?

We use it for data engineering and analytics to process and examine extensive datasets.

What is most valuable?

It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained.

What needs improvement?

One limitation is that not all machine learning libraries and models support it. While libraries like Scikit-learn may work with some Spark-compatible models, not all machine-learning tools are compatible with Spark. In such cases, you may need to extract data from Spark and train your models on smaller datasets instead of directly using Spark for training.

For how long have I used the solution?

I have been using it for four years.

What do I think about the stability of the solution?

I have not encountered any significant stability issues and it has proven to be a robust and reliable platform without major crashes. However, there have been instances where I needed to address query optimization and similar tasks to ensure optimal performance. I would rate it nine out of ten.

How are customer service and support?

To rate my overall experience, I would give it an eight out of ten, leaving room for potential improvements in terms of technical support.

How would you rate customer service and support?

Positive

Which solution did I use previously and why did I switch?

We used Pandas data frames and SQL-type queries for smaller datasets, but we haven't worked with anything on the scale of Spark SQL.

How was the initial setup?

I haven't handled the deployment process, but setting it up on the cloud seems relatively straightforward.

What about the implementation team?

Setting it up on-premises might take longer, potentially a couple of days. However, when deploying it on the cloud, the process can be significantly quicker, possibly taking only a few hours.

What's my experience with pricing, setup cost, and licensing?

On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing. Managing costs in a cloud environment can be challenging due to the cumulative expenses associated with running and maintaining Spark. Licensing costs may not be the primary concern, but operational costs in the cloud can add up. For on-premises deployments, maintenance costs include cluster management, job optimization, and upgrades. In the cloud, maintenance costs are relatively lower, especially with managed database clusters, but they still exist and primarily revolve around cluster upkeep.

Which other solutions did I evaluate?

We evaluated Microsoft Synapse, which offers similar analytics functionality but not quite at the same scale as Apache Spark and Spark as a whole. While some tasks can be accomplished with Synapse on AWS, there are certain features and capabilities, such as micro-batching and scalability, that Spark excels at and remains unmatched.

What other advice do I have?

Additional skill requirements are crucial to use the solution and its related features effectively. Training costs and efforts may be necessary to ensure individuals are proficient in using these technologies. Overall, I would rate it nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Suresh_Srinivasan - PeerSpot reviewer
Co-Founder at FORMCEPT Technologies
Real User
Top 10
Enables us to process data from different data sources
Pros and Cons
  • "We use Spark to process data from different data sources."
  • "In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."

What is our primary use case?

Our primary use case is for interactively processing large volume of data.

What is most valuable?

We use Spark to process data from different data sources. 

What needs improvement?

In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond

For how long have I used the solution?

I have been using Apache Spark for eight to nine years. 

What do I think about the stability of the solution?

It is a stable solution. The solution is ten out of ten on stability. 

What do I think about the scalability of the solution?

The solution is highly scalable. All of the technical guys use Spark. Our product is used by many people within our customers' company.

How was the initial setup?

The initial setup is straightforward. 

What's my experience with pricing, setup cost, and licensing?

The solution is moderately priced. 

What other advice do I have?

I rate the overall solution a ten out of ten. 

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer1759647 - PeerSpot reviewer
Information Technology Business Analyst at a aerospace/defense firm with 10,001+ employees
Real User
A highly scalable and affordable tool that can be used to gather information from different systems
Pros and Cons
  • "The product is useful for analytics."
  • "The product could improve the user interface and make it easier for new users."

What is most valuable?

We use it as an ETL tool to gather information from different systems. The product is useful for analytics.

What needs improvement?

The product could improve the user interface and make it easier for new users. It has a steep learning curve.

For how long have I used the solution?

I have been using the product for approximately three to four years. Currently, I am using the latest version.

What do I think about the stability of the solution?

The tool is stable. I rate the stability a ten out of ten.

What do I think about the scalability of the solution?

The tool is very scalable. I rate the scalability a ten out of ten. Approximately 30 users are using Apache Spark in our organization.

How are customer service and support?

We are using the free version of the product. So, we are not using any support.

How would you rate customer service and support?

Positive

How was the initial setup?

The basic installation is easy. However, we are working in the security business and need a very secure installation. It has been quite difficult. I rate the basic installation a ten out of ten. I rate the ease of setup a two or three out of ten for a more secure installation with all the security features. The solution is deployed on-premises in our organization. The deployment process requires a couple of weeks.

What's my experience with pricing, setup cost, and licensing?

We are using the free version of the solution.

What other advice do I have?

I would recommend the product. I think it's a good solution for analytics. Overall, I rate the product an eight out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
CTO at Hammerknife
Real User
Top 5
Provides a valuable implementation of distributed data processing with a simple setup process
Pros and Cons
  • "Apache Spark provides a very high-quality implementation of distributed data processing."
  • "There were some problems related to the product's compatibility with a few Python libraries."

What is our primary use case?

We use the product for real-time data analysis.

What is most valuable?

Apache Spark provides a very high-quality implementation of distributed data processing. I rate it 20 on a scale of one to ten.

What needs improvement?

There were some problems related to the product's compatibility with a few Python libraries. But I suppose they are fixed.

For how long have I used the solution?

We have been using Apache Spark for the last two to three years.

What do I think about the stability of the solution?

I rate the product's stability a ten out of ten.

What do I think about the scalability of the solution?

The product is enormously scalable.

How was the initial setup?

The initial setup process is simple if you are a good professional. You have to select a few parameters and press enter. It is already integrated into Databricks platform. One person is enough to manage small and medium implementations.

What's my experience with pricing, setup cost, and licensing?

It is an open-source platform. We do not pay for its subscription.

Which other solutions did I evaluate?

We are evaluating a few analytics engineering and DBT solutions. For now, Spark is in the secondary position.

What other advice do I have?

I recommend Apache Spark for batch analytics features.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
PLC Programmer at Alzero
Real User
Top 20
Highly-recommended robust solution for data processing
Pros and Cons
  • "I appreciate everything about the solution, not just one or two specific features. The solution is highly stable. I rate it a perfect ten. The solution is highly scalable. I rate it a perfect ten. The initial setup was straightforward. I recommend using the solution. Overall, I rate the solution a perfect ten."
  • "The solution’s integration with other platforms should be improved."

What is our primary use case?

We are a software solutions company that serves a variety of industries, including banking, insurance, and industrial sectors. The product is specifically employed for managing data platforms for our customers.


What is most valuable?

The solution, as a package, excels across the board. I appreciate everything, not just one or two specific features.


What needs improvement?

The solution’s integration with other platforms should be improved.


For how long have I used the solution?

I have been using the solution for the past eight years. Currently, I’m using the latest version of the solution.


What do I think about the stability of the solution?

The solution is highly stable. I rate it a perfect ten.


What do I think about the scalability of the solution?

The solution is highly scalable. I rate it a perfect ten.


How was the initial setup?

The initial setup was straightforward and was conducted on the cloud. The entire deployment process took just 15 minutes. The deployment process involves provisioning the computational part tool using Terraform.


What's my experience with pricing, setup cost, and licensing?

The solution is affordable and there are no additional licensing costs.


What other advice do I have?

I recommend using the solution. Overall, I rate the solution a perfect ten.


Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user
Oscar Estorach - PeerSpot reviewer
Chief Data-strategist and Director at Theworkshop.es
Real User
Top 10
Scalable, open-source, and great for transforming data
Pros and Cons
  • "The solution has been very stable."
  • "It's not easy to install."

What is our primary use case?

You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.

What is most valuable?

Overall, it's a very nice tool.

It is great for transforming data and doing micro-streamings or micro-batching.

The product offers an open-source version.

The solution has been very stable.

The scalability is good.

Apache Spark is a huge tool. It has many use cases and is very flexible. You can use it with so many other platforms. 

Spark, as a tool, is easy to work with as you can work with Python, Scala, and Java.

What needs improvement?

If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing.

It's not easy to install. You are typically dealing with a big data system.

It's not a simple, straightforward architecture. 

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The stability is very good. There are no bugs or glitches and it doesn't crash or freeze. It's a reliable solution. 

What do I think about the scalability of the solution?

We have found the scalability to be good. If your company needs to expand it, it can do so.

We have five people working on the solution currently.

How are customer service and technical support?

There isn't really technical support for open source. You need to do your own studying. There are lots of places to find information. You can find details online, or in books, et cetera. There are even courses you can take that can help you understand Spark.

Which solution did I use previously and why did I switch?

I also use Databricks, which I use in the cloud.

How was the initial setup?

When handling big data systems, the installation is a bit difficult. When you need to deploy the systems, it's better to use services like Databricks.

I am not a professional admin. I am a developer for and design architecture.

You can use it in your standalone system, however, it's not the best way. It would be okay for little branch codes, not for production.

What's my experience with pricing, setup cost, and licensing?

We use the open-source version. It is free to use. However, you do need to have servers. We have three or four. they can be on-premises or in the cloud. 

What other advice do I have?

I have the solution installed on my computer and on our servers. You can use it on-premises or as a SaaS.

I'd rate the solution at a nine out of ten. I've been very pleased with its capabilities. 

I would recommend the solution for the people who need to deploy projects with streaming. If you have many different sources or different types of data, and you need to put everything in the same place - like a data lake - Spark, at this moment, has the right tools. It's an important solution for data science, for data detectors. You can put all of the information in one place with Spark.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user