Try our new research platform with insights from 80,000+ expert users
Hamid M. Hamid - PeerSpot reviewer
Data architect at a university with 5,001-10,000 employees
Real User
Top 20
Feb 12, 2024
Along with the easy cluster deployment process, the tool also has the ability to process huge datasets
Pros and Cons
  • "The deployment of the product is easy."
  • "Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users."

What is our primary use case?

In my company, the solution is used for batch processing or real-time processing.

What needs improvement?

The product has matured at the moment. The product's interoperability is an area of concern where improvements are required.

Apache Spark can be integrated with high-tech tools like Informatica. Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users.

For how long have I used the solution?

I have been using Apache Spark for three years.

What do I think about the stability of the solution?

Stability-wise, I rate the solution a nine out of ten.

Buyer's Guide
Apache Spark
February 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: February 2026.
882,606 professionals have used our research since 2012.

What do I think about the scalability of the solution?

It is a very scalable solution. Scalability-wise, I rate the solution a nine out of ten.

There are no different numbers of uses for Apache Spark in my company since it is used as a processing engine.

How are customer service and support?

Apache Spark is an open-source tool, so the only support users can get for the tool is from different vendors like Cloudera or HPE.

Which solution did I use previously and why did I switch?

In the past, my company has used certain ETL tools, like Informatica, based on the performance levels offered.

How was the initial setup?

The deployment of the product is easy.

Apache Spark's cluster deployment process is very easy.

There is only a deployment process required for an application to run on Apache Spark. Apache Spark itself is a setup tool. Deploying an application using Apache Spark is easy as a user since you just need to submit the code in Scala and submit it to the cluster, and then the deployment process can be done in one step.

The solution is deployed on an on-premises model.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source tool. It is not an expensive product.

What other advice do I have?

The tool is used for real-time data analytics as it is very powerful and reliable. The code that you write with Apache Spark provides stability. There are many bugs that can appear according to the code that you use, which could be Java or Scala. So this is amazing. Apache Spark is very reliable, powerful, and fast as an engine. When compared with another competitor like MapReduce, Apache Spark performs 100 times better than MapReduce.

The monitoring part of the product is good.

The product offers clusters that are resilient and can run into multiple nodes.

The tool can run with multiple clusters.

The integration capabilities of the product with other platforms to improve our company's workflow are good.

In terms of the improvements in the product in the data analysis area, new libraries have been launched to support AI and machine learning.

My company is able to process huge datasets with Apache Spark. There is a huge value added to the organization because of the tool's ability to process huge datasets.

I rate the overall solution a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Ilya Afanasyev - PeerSpot reviewer
Senior Software Development Engineer at a media company with 201-500 employees
Real User
Aug 22, 2022
Reliable, able to expand, and handle large amounts of data well
Pros and Cons
  • "There's a lot of functionality."
  • "I know there is always discussion about which language to write applications in and some people do love Scala. However, I don't like it."

What is our primary use case?

It's a root product that we use in our pipeline.

We have some input data. For example, we have one system that supplies some data to MongoDB, for example, and we pull this data from MongoDB, enrich this data from other systems - with some additional fields - and write to S3 for other systems. Since we have a lot of data, we need a parallel process that runs hourly.

What is most valuable?

We use batch processing. It works well with our formats and file versions. There's a lot of functionality. 

In our pipeline each hour, we make a copy of data from MongoDB, of the changes from MongoDB to some specific file. Each time pipeline copied all of the data, it would do it each time without changes to all of the tables. Tables have a lot of data, and in the last MongoDB version, there is a possibility to read only changed data. This reduced the cost and configuration of the cluster, and we saved about $150,000.

The solution is scalable.

It's a stable product.

What needs improvement?

The primary language for developers on Spark is Scala. Now it's also about Java. I prefer Java versus Scala, and since they are supported, it is good. I know there is always discussion about which language to write applications in, and some people do love Scala. However, I don't like it.

They use currently have a JDK version which is a little bit old. Not all features are on it. Maybe they should pull support of the JDK version.

For how long have I used the solution?

I've used the solution for a year and a half. 

What do I think about the stability of the solution?

The solution is stable. There are no bugs or glitches. It doesn't crash or freeze. 

What do I think about the scalability of the solution?

The product scales well. It's fine to expand if needed. 

Many teams use Spark. For example, we have a few kinds of pipelines, huge pipelines. One of them processes 300 billion events each day. It's our core technology currently.

We do not plan to increase usage. We keep our legacy system on Spark, and we are now discussing Flink and Spark and what we would prefer. However, most of the people are already migrating new systems to Flink. We will keep Spark for a few more years still. 

How are customer service and support?

We have an internal team, and they participate in process of developing Spark. They are Spark contributors, and if we have some problems, we turn to them. It's our own people, yet they work with Spark. Generally, if the problem is more minor, we look at some sites or have some discussion about Spark or internal guys who have experience with Spark. 

Which solution did I use previously and why did I switch?

We also use Flink.

Before Spark, I worked with another company that we used some different technology, including Kafka, Radius, Postgres SQL, S3, and Spring. 

How was the initial setup?

I didn't handle the initial setup. We were using this pipeline and clusters already. I just installed it on my local server. However, in terms of difficulty, I didn't see any problem. The deployment might only take a few hours. 

I found some documentation. I got the documentation from the site and downloaded the archive and unzipped it, and installed it. I can't say that I installed something from a special configuration. I just installed a few nodes for debugging and for running locally, and that's all. Also, in one case I used, for example, a Docker configuration with Spark. It all worked fine.

What's my experience with pricing, setup cost, and licensing?

It's an open-source product. I don't know much about the licensing aspect. 

Which other solutions did I evaluate?

We have compared Flink and Spark as two possible options. 

What other advice do I have?

I can recommend the product. It's a nice system for batch processing huge data.

I'd rate the solution eight out of ten. 

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Apache Spark
February 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: February 2026.
882,606 professionals have used our research since 2012.
Atif Tariq - PeerSpot reviewer
Cloud and Big Data Engineer | Developer at a tech vendor with 10,001+ employees
Real User
Nov 29, 2023
A scalable solution that can be used for data computation and building data pipelines
Pros and Cons
  • "The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast."
  • "Apache Spark should add some resource management improvements to the algorithms."

What is our primary use case?

Apache Spark is used for data computation, building data pipelines, or building analytics on top of batch data. Apache Spark is used to handle big data efficiently.

What is most valuable?

The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast.

What needs improvement?

Apache Spark should add some resource management improvements to the algorithms. Thereby, the solution can manage SKUs more efficiently with a physical and logical plan over the different data sets when you are joining it.

For how long have I used the solution?

I have been working with Apache Spark for six to seven years.

What do I think about the stability of the solution?

Apache Spark is a very stable solution. The community is still working on other parts, like performance and removing bottlenecks. However, from a stipulative point of view, the solution's stability is very good.

I rate Apache Spark a nine out of ten for stability.

What do I think about the scalability of the solution?

Apache Spark is a scalable solution. More than 50 to 100 users are using the solution in our organization.

How are customer service and support?

Apache Spark's technical support team responds on time.

How would you rate customer service and support?

Positive

How was the initial setup?

The solution’s initial setup is very easy.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source solution, and there is no cost involved in deploying the solution on-premises.

What other advice do I have?

I would recommend Apache Spark to users doing analytics, data computation, or pipelines.

Overall, I rate Apache Spark ten out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Lucas Dreyer - PeerSpot reviewer
Data Engineer at a tech vendor with 1,001-5,000 employees
Real User
Top 5Leaderboard
Oct 30, 2023
A reliable and scalable open-source framework for big data processing that excels in speed, fault tolerance, and support for various data sources
Pros and Cons
  • "It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained."
  • "One limitation is that not all machine learning libraries and models support it."

What is our primary use case?

We use it for data engineering and analytics to process and examine extensive datasets.

What is most valuable?

It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained.

What needs improvement?

One limitation is that not all machine learning libraries and models support it. While libraries like Scikit-learn may work with some Spark-compatible models, not all machine-learning tools are compatible with Spark. In such cases, you may need to extract data from Spark and train your models on smaller datasets instead of directly using Spark for training.

For how long have I used the solution?

I have been using it for four years.

What do I think about the stability of the solution?

I have not encountered any significant stability issues and it has proven to be a robust and reliable platform without major crashes. However, there have been instances where I needed to address query optimization and similar tasks to ensure optimal performance. I would rate it nine out of ten.

How are customer service and support?

To rate my overall experience, I would give it an eight out of ten, leaving room for potential improvements in terms of technical support.

How would you rate customer service and support?

Positive

Which solution did I use previously and why did I switch?

We used Pandas data frames and SQL-type queries for smaller datasets, but we haven't worked with anything on the scale of Spark SQL.

How was the initial setup?

I haven't handled the deployment process, but setting it up on the cloud seems relatively straightforward.

What about the implementation team?

Setting it up on-premises might take longer, potentially a couple of days. However, when deploying it on the cloud, the process can be significantly quicker, possibly taking only a few hours.

What's my experience with pricing, setup cost, and licensing?

On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing. Managing costs in a cloud environment can be challenging due to the cumulative expenses associated with running and maintaining Spark. Licensing costs may not be the primary concern, but operational costs in the cloud can add up. For on-premises deployments, maintenance costs include cluster management, job optimization, and upgrades. In the cloud, maintenance costs are relatively lower, especially with managed database clusters, but they still exist and primarily revolve around cluster upkeep.

Which other solutions did I evaluate?

We evaluated Microsoft Synapse, which offers similar analytics functionality but not quite at the same scale as Apache Spark and Spark as a whole. While some tasks can be accomplished with Synapse on AWS, there are certain features and capabilities, such as micro-batching and scalability, that Spark excels at and remains unmatched.

What other advice do I have?

Additional skill requirements are crucial to use the solution and its related features effectively. Training costs and efforts may be necessary to ensure individuals are proficient in using these technologies. Overall, I would rate it nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer1759647 - PeerSpot reviewer
Information Technology Business Analyst at a aerospace/defense firm with 10,001+ employees
Real User
Jul 25, 2023
A highly scalable and affordable tool that can be used to gather information from different systems
Pros and Cons
  • "The product is useful for analytics."
  • "The product could improve the user interface and make it easier for new users."

What is most valuable?

We use it as an ETL tool to gather information from different systems. The product is useful for analytics.

What needs improvement?

The product could improve the user interface and make it easier for new users. It has a steep learning curve.

For how long have I used the solution?

I have been using the product for approximately three to four years. Currently, I am using the latest version.

What do I think about the stability of the solution?

The tool is stable. I rate the stability a ten out of ten.

What do I think about the scalability of the solution?

The tool is very scalable. I rate the scalability a ten out of ten. Approximately 30 users are using Apache Spark in our organization.

How are customer service and support?

We are using the free version of the product. So, we are not using any support.

How would you rate customer service and support?

Positive

How was the initial setup?

The basic installation is easy. However, we are working in the security business and need a very secure installation. It has been quite difficult. I rate the basic installation a ten out of ten. I rate the ease of setup a two or three out of ten for a more secure installation with all the security features. The solution is deployed on-premises in our organization. The deployment process requires a couple of weeks.

What's my experience with pricing, setup cost, and licensing?

We are using the free version of the solution.

What other advice do I have?

I would recommend the product. I think it's a good solution for analytics. Overall, I rate the product an eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
CTO at a tech services company with 1-10 employees
Real User
Top 20
Dec 20, 2023
Provides a valuable implementation of distributed data processing with a simple setup process
Pros and Cons
  • "Apache Spark provides a very high-quality implementation of distributed data processing."
  • "There were some problems related to the product's compatibility with a few Python libraries."

What is our primary use case?

We use the product for real-time data analysis.

What is most valuable?

Apache Spark provides a very high-quality implementation of distributed data processing. I rate it 20 on a scale of one to ten.

What needs improvement?

There were some problems related to the product's compatibility with a few Python libraries. But I suppose they are fixed.

For how long have I used the solution?

We have been using Apache Spark for the last two to three years.

What do I think about the stability of the solution?

I rate the product's stability a ten out of ten.

What do I think about the scalability of the solution?

The product is enormously scalable.

How was the initial setup?

The initial setup process is simple if you are a good professional. You have to select a few parameters and press enter. It is already integrated into Databricks platform. One person is enough to manage small and medium implementations.

What's my experience with pricing, setup cost, and licensing?

It is an open-source platform. We do not pay for its subscription.

Which other solutions did I evaluate?

We are evaluating a few analytics engineering and DBT solutions. For now, Spark is in the secondary position.

What other advice do I have?

I recommend Apache Spark for batch analytics features.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
PLC Programmer at a engineering company with 1-10 employees
Real User
Dec 1, 2023
Highly-recommended robust solution for data processing
Pros and Cons
  • "I appreciate everything about the solution, not just one or two specific features. The solution is highly stable. I rate it a perfect ten. The solution is highly scalable. I rate it a perfect ten. The initial setup was straightforward. I recommend using the solution. Overall, I rate the solution a perfect ten."
  • "The solution’s integration with other platforms should be improved."

What is our primary use case?

We are a software solutions company that serves a variety of industries, including banking, insurance, and industrial sectors. The product is specifically employed for managing data platforms for our customers.


What is most valuable?

The solution, as a package, excels across the board. I appreciate everything, not just one or two specific features.


What needs improvement?

The solution’s integration with other platforms should be improved.


For how long have I used the solution?

I have been using the solution for the past eight years. Currently, I’m using the latest version of the solution.


What do I think about the stability of the solution?

The solution is highly stable. I rate it a perfect ten.


What do I think about the scalability of the solution?

The solution is highly scalable. I rate it a perfect ten.


How was the initial setup?

The initial setup was straightforward and was conducted on the cloud. The entire deployment process took just 15 minutes. The deployment process involves provisioning the computational part tool using Terraform.


What's my experience with pricing, setup cost, and licensing?

The solution is affordable and there are no additional licensing costs.


What other advice do I have?

I recommend using the solution. Overall, I rate the solution a perfect ten.


Disclosure: My company has a business relationship with this vendor other than being a customer. Partner
PeerSpot user
Lokesh Jayanna - PeerSpot reviewer
Vice President at Goldman Sachs at a computer software company with 10,001+ employees
Real User
Nov 26, 2023
Stable product with a valuable SQL tool
Pros and Cons
  • "The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."
  • "At the initial stage, the product provides no container logs to check the activity."

What is our primary use case?

We use the product for extensive data analysis. It helps us analyze a huge amount of data and transfer it to data scientists in our organization.

What is most valuable?

The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it. It is a useful feature for us.

What needs improvement?

At the initial stage, the product provides no container logs to check the activity. It remains inactive for a long time without giving us any information. The containers could start quickly, similar to that of Jupyter Notebook.

For how long have I used the solution?

We have been using Apache Spark for eight months to one year.

What do I think about the stability of the solution?

It is a stable product. I rate its stability an eight out of ten.

What do I think about the scalability of the solution?

We have 45 Apache Spark users. I rate its scalability a nine out of ten.

How was the initial setup?

The complexity of the initial setup depends on the kind of environment an organization is working with. It requires one executive for deployment. I rate the process an eight out of ten.

What's my experience with pricing, setup cost, and licensing?

The product is expensive, considering the setup. However, from a standalone perspective, it is inexpensive.

What other advice do I have?

I advise others to analyze data and understand your business requirements before purchasing the product. I rate it an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: February 2026
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.