Try our new research platform with insights from 80,000+ expert users
Head of Data Science center of excellence at Ameriabank CJSC
Real User
Top 5Leaderboard
Enhanced data processing with good support and helpful integration with Pandas syntax in distributed mode
Pros and Cons
  • "The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features."
  • "The main concern is the overhead of Java when distributed processing is not necessary."

What is our primary use case?

The primary use case for Apache Spark is to process data in memory, using big data, and distributing the engine to process said data. It is used for various tasks such as running the association rules algorithm in ML Spark ML, running XGBoost in parallel using the Spark engine, and preparing data for online machine learning using Spark Streaming mode.

How has it helped my organization?

The most significant cost savings come from the operational side because Spark is very typical in operations. There are many experts available in the market to operate Spark, making it easier to find the right personnel. It is quite mature, which reduces operation costs.

What is most valuable?

The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features. This allows running Pandas code distributed by using the Spark engine, which is a crucial feature. The integration with Pandas syntax in distributed mode, along with the user-defined functions in PySpark, is particularly valuable.

What needs improvement?

The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Consequently, alternatives like Doc DB are more preferable. Additionally, performance in some cases is slower, making alternatives two to five times faster.

Buyer's Guide
Apache Spark
March 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2025.
839,422 professionals have used our research since 2012.

For how long have I used the solution?

I have more than ten years of experience using Spark, starting from when it was first introduced.

What do I think about the stability of the solution?

Spark is very stable for our needs. It offers amazing stability.

What do I think about the scalability of the solution?

Scalability depends on how infrastructure is organized. Better balance and network considerations are necessary. However, Spark is very stable when scaled appropriately.

How are customer service and support?

Customer support for Apache Spark is very good. There is a lot of documentation and forums available, making it easier to find solutions. The Databricks team also does a lot to support Spark.

How would you rate customer service and support?

Positive

How was the initial setup?

The initial setup of Spark can take about a week, assuming the right infrastructure is already in place.

What about the implementation team?

A few technicians are typically required for installation and configuration. SRE engineers or operational guys handle the setup, as they need to understand the details about installation and configuration. Maintenance usually requires just an SRE engineer or operational guy.

What was our ROI?

The main benefit in terms of ROI comes from the operation side. Spark’s operational costs are lower due to the availability of experts and its maturity. However, performance costs might be higher due to the need for more memory and infrastructure.

What's my experience with pricing, setup cost, and licensing?

Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud solutions like Databricks can simplify the process, they may also be less cost-efficient.

What other advice do I have?

I'd rate the solution eight out of ten. 

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
reviewer2534727 - PeerSpot reviewer
Manager Data Analytics at a consultancy with 10,001+ employees
Real User
A flexible solution with real-time processing capabilities
Pros and Cons
  • "I like Apache Spark's flexibility the most. Before, we had one server that would choke up. With the solution, we can easily add more nodes when needed. The machine learning models are also really helpful. We use them to predict energy theft and find infrastructure problems."
  • "For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial."

What is our primary use case?

We use the solution to extract data from our sensors.  We have lots of data streaming into our system, which used to get overwhelmed. We use Apache Spark to handle real-time streaming and do machine learning to predict supply and demand in the market and adjust operations.

What is most valuable?

I like Apache Spark's flexibility the most. Before, we had one server that would choke up. With the solution, we can easily add more nodes when needed. The machine learning models are also really helpful. We use them to predict energy theft and find infrastructure problems.

The tool's real-time processing has had a big impact. We used to get data from sensors after a month. We get it in less than 10 minutes, which helps us take quick action.

We use Apache Spark to map our data pipelines using MapReduce technology. We're also working on integrating tools like Hive with Apache Spark to distribute our data processing. We can also integrate other tools like Apache Kafka and Hadoop.

We faced some challenges when integrating the solution into our existing system, but good documentation helped solve them.

What needs improvement?

For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial.

For how long have I used the solution?

I have been working with the product for five years. 

What do I think about the stability of the solution?

Apache Spark is stable. 

What do I think about the scalability of the solution?

We're a big company with about 4 million consumers. We handle huge amounts of data—around 30,000 sensors send data every 15 minutes, which adds up to 5-10 terabytes per day.

Which solution did I use previously and why did I switch?

Before Apache Spark, we had a different solution - a traditional system with one server handling everything, more like a data warehouse. We switched to Apache Spark because we needed real-time visibility in our operations.

How was the initial setup?

The initial setup process was challenging. We tried to do it ourselves at first, but we weren't used to distributed computing systems, creating nodes, and distributing data. Later, we engaged consulting groups that specialized in it. This is why there's a specific learning curve—it would be challenging for a company to start alone.

The initial deployment took us about six to eight months. We started with three people involved in the deployment process and later increased to five. From a maintenance point of view, it's pretty smooth now. It's not difficult to maintain and doesn't require much maintenance.

What was our ROI?

The tool has helped us reduce costs that run into billions of dollars yearly. The ROI is very significant for us.

Which other solutions did I evaluate?

We did evaluate other options. We started by looking at open-source Hadoop deployment, thinking we'd bring data into HDFS and do machine learning separately. But that would have been a hassle, so Apache Spark was a better fit.

What other advice do I have?

I rate the overall solution a seven out of ten. I would recommend Apache Spark to other users, but it depends on their use cases. I advise new users to get an expert involved from the start.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
Buyer's Guide
Apache Spark
March 2025
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2025.
839,422 professionals have used our research since 2012.
Atal Upadhyay - PeerSpot reviewer
AVP at MIDDAY INFOMEDIA LIMITED
Real User
Top 5Leaderboard
Allows us to consume data from any data source and has a remarkable processing power
Pros and Cons
  • "With Spark, we parallelize our operations, efficiently accessing both historical and real-time data."
  • "It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework."

What is our primary use case?

We pull data from various sources and employ a buzzword to process it for reporting purposes, utilizing a prominent visual analytics tool.

How has it helped my organization?

Our experience with using Spark for machine learning and big data analytics allows us to consume data from any data source, including freely available data. The processing power of Spark is remarkable, making it our top choice for file-processing tasks.

Utilizing Apache Spark's in-memory processing capabilities significantly enhances our computational efficiency. Unlike with Oracle, where customization is limited, we can tailor Spark to our needs. This allows us to pull data, perform tests, and save processing power. We maintain a historical record by loading intermediate results and retrieving data from previous iterations, ensuring our applications operate seamlessly. With Spark, we parallelize our operations, efficiently accessing both historical and real-time data.

We utilize Apache Spark for our data analysis tasks. Our data processing pipeline starts with receiving data in the RAV format. We employ a data factory to create pipelines for data processing. This ensures that the data is prepared and made ready for various purposes, such as supporting applications or analysis.

There are instances where we perform data cleansing operations and manage the database, including indexing. We've implemented automated tasks to analyze data and optimize performance, focusing specifically on database operations. These efforts are independent of the Spark platform but contribute to enhancing overall performance.

What needs improvement?

It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework.

For how long have I used the solution?

I've been engaged with Apache Spark for about a year now, but my company has been utilizing it for over a decade.

What do I think about the stability of the solution?

It offers a high level of stability. I would rate it nine out of ten.

What do I think about the scalability of the solution?

It serves as a data node, making it highly scalable. It caters to a user base of around five thousand or so.

How was the initial setup?

The initial setup isn't complicated, but it varies from person to person. For me, it wasn't particularly complex; it was straightforward to use.

What about the implementation team?

Once the solution is prepared, we deploy it onto both the staging server and the production server. Previously, we had a dedicated individual responsible for deploying the solution across multiple machines. We manage three environments: development, staging, and production. The deployment process varies, sometimes utilizing a tenant model and other times employing blue-green deployment, depending on the situation. This ensures the seamless setup of servers and facilitates smooth operations.

What other advice do I have?

Given our extensive experience with it and its ability to meet all our requirements over time, I highly recommend it. Overall, I would rate it nine out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
UjjwalGupta - PeerSpot reviewer
Module Lead at Mphasis
Real User
Top 5
Helps to build ETL pipelines load data to warehouses
Pros and Cons
  • "The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily."
  • "Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial."

What is our primary use case?

We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.

What is most valuable?

The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily.

What needs improvement?

Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial.

For how long have I used the solution?

I have been using the product for six years. 

What do I think about the stability of the solution?

Apache Spark is generally considered a stable product, with rare instances of breaking down. Issues may arise in sudden increases in data volume, leading to memory errors, but these can typically be managed with autoscaling clusters. Additionally, schema changes or irregularities in streaming data may pose challenges, but these could be addressed in future software versions.

What do I think about the scalability of the solution?

About 70-80 percent of employees in my company use the product. 

How are customer service and support?

We haven't contacted Apache Spark support directly because it's an open-source tool. However, when using it as a product within Databricks, we've contacted Databricks support for assistance.

Which solution did I use previously and why did I switch?

The main reason our company opted for the product is its capability to process large volumes of data. While other options like Snowflake offer some advantages, they may have limitations regarding custom logic or modifications.

How was the initial setup?

The solution's setup and installation of Apache Spark can vary in complexity depending on whether it's done in a standalone or cluster environment. The process is generally more straightforward in a standalone setup, especially if you're familiar with the concepts involved. However, setting up in a cluster environment may require more knowledge about clusters and networking, making it potentially more complex.

What's my experience with pricing, setup cost, and licensing?

The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks.

What other advice do I have?

If you're new to Apache Spark, the best way to learn is by using the Databricks Community Edition. It provides a cluster for Apache Spark where you can learn and test. I rate the product an eight out of ten.

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Suriya Senthilkumar - PeerSpot reviewer
Analyst at Deloitte
Real User
Top 10
Processes a larger volume of data efficiently and integrates with different platforms
Pros and Cons
  • "The product’s most valuable features are lazy evaluation and workload distribution."
  • "They could improve the issues related to programming language for the platform."

What is our primary use case?

We use the product in our environment for data processing and performing Data Definition Language (DDL) operations.

What is most valuable?

The product’s most valuable features are lazy evaluation and workload distribution.

What needs improvement?

They could improve the issues related to programming language for the platform. 

For how long have I used the solution?

We have been using Apache Spark for around two and a half years.

What do I think about the stability of the solution?

The platform’s stability depends on how effectively we write the code. We encountered a few issues related to programming languages.

What do I think about the scalability of the solution?

We have more than 100 Apache Spark users in our organization.

Which solution did I use previously and why did I switch?

Before choosing Apache Spark for processing big data, we evaluated another option, Hadoop. However, Spark emerged as a superior choice comparatively.

How was the initial setup?

The initial setup complexity depends on whether it's on the cloud or on-premise. For cloud deployments, especially using platforms like Databricks, the process is straightforward and can be configured with ease. However, if the deployment is on-premise, the setup tends to be more time-consuming, although not overly complex.

What's my experience with pricing, setup cost, and licensing?

They provide an open-source license for the on-premise version. However, we have to pay for the cloud version including data centers and virtual machines.

What other advice do I have?

Apache Spark is a good product for processing large volumes of data compared to other distributed systems. It provides efficient integration with Hadoop and other platforms.

I rate it a ten out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Hamid M. Hamid - PeerSpot reviewer
Data architect at Banking Sector
Real User
Top 5Leaderboard
Along with the easy cluster deployment process, the tool also has the ability to process huge datasets
Pros and Cons
  • "The deployment of the product is easy."
  • "Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users."

What is our primary use case?

In my company, the solution is used for batch processing or real-time processing.

What needs improvement?

The product has matured at the moment. The product's interoperability is an area of concern where improvements are required.

Apache Spark can be integrated with high-tech tools like Informatica. Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users.

For how long have I used the solution?

I have been using Apache Spark for three years.

What do I think about the stability of the solution?

Stability-wise, I rate the solution a nine out of ten.

What do I think about the scalability of the solution?

It is a very scalable solution. Scalability-wise, I rate the solution a nine out of ten.

There are no different numbers of uses for Apache Spark in my company since it is used as a processing engine.

How are customer service and support?

Apache Spark is an open-source tool, so the only support users can get for the tool is from different vendors like Cloudera or HPE.

Which solution did I use previously and why did I switch?

In the past, my company has used certain ETL tools, like Informatica, based on the performance levels offered.

How was the initial setup?

The deployment of the product is easy.

Apache Spark's cluster deployment process is very easy.

There is only a deployment process required for an application to run on Apache Spark. Apache Spark itself is a setup tool. Deploying an application using Apache Spark is easy as a user since you just need to submit the code in Scala and submit it to the cluster, and then the deployment process can be done in one step.

The solution is deployed on an on-premises model.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source tool. It is not an expensive product.

What other advice do I have?

The tool is used for real-time data analytics as it is very powerful and reliable. The code that you write with Apache Spark provides stability. There are many bugs that can appear according to the code that you use, which could be Java or Scala. So this is amazing. Apache Spark is very reliable, powerful, and fast as an engine. When compared with another competitor like MapReduce, Apache Spark performs 100 times better than MapReduce.

The monitoring part of the product is good.

The product offers clusters that are resilient and can run into multiple nodes.

The tool can run with multiple clusters.

The integration capabilities of the product with other platforms to improve our company's workflow are good.

In terms of the improvements in the product in the data analysis area, new libraries have been launched to support AI and machine learning.

My company is able to process huge datasets with Apache Spark. There is a huge value added to the organization because of the tool's ability to process huge datasets.

I rate the overall solution a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Atif Tariq - PeerSpot reviewer
Cloud and Big Data Engineer | Developer at Huawei Cloud Middle East
Real User
Top 5Leaderboard
A scalable solution that can be used for data computation and building data pipelines
Pros and Cons
  • "The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast."
  • "Apache Spark should add some resource management improvements to the algorithms."

What is our primary use case?

Apache Spark is used for data computation, building data pipelines, or building analytics on top of batch data. Apache Spark is used to handle big data efficiently.

What is most valuable?

The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast.

What needs improvement?

Apache Spark should add some resource management improvements to the algorithms. Thereby, the solution can manage SKUs more efficiently with a physical and logical plan over the different data sets when you are joining it.

For how long have I used the solution?

I have been working with Apache Spark for six to seven years.

What do I think about the stability of the solution?

Apache Spark is a very stable solution. The community is still working on other parts, like performance and removing bottlenecks. However, from a stipulative point of view, the solution's stability is very good.

I rate Apache Spark a nine out of ten for stability.

What do I think about the scalability of the solution?

Apache Spark is a scalable solution. More than 50 to 100 users are using the solution in our organization.

How are customer service and support?

Apache Spark's technical support team responds on time.

How would you rate customer service and support?

Positive

How was the initial setup?

The solution’s initial setup is very easy.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source solution, and there is no cost involved in deploying the solution on-premises.

What other advice do I have?

I would recommend Apache Spark to users doing analytics, data computation, or pipelines.

Overall, I rate Apache Spark ten out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer2150616 - PeerSpot reviewer
Lead Data Scientist at a transportation company with 51-200 employees
Real User
Top 5
Offers user-friendliness, clarity and flexibility
Pros and Cons
  • "The product's initial setup phase was easy."
  • "From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable."

What needs improvement?

The only issue I faced with the tool was that I used to choose the compute device to support parallel processing, and it has to be more like scaling up horizontally. The tool should be more scalable, not in terms of increasing the CPU or something, but more in the area of units. If two units are not enough, the third or fourth unit should be able to come into the picture.

From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable. Sometimes, I get an error saying that it is an RDD-related error, and it becomes difficult to understand where it went wrong. When I deal with datasets using a library called Pandas in Python, I can actually apply functions on each column and get a transformation from the column. When I try to do the same thing with Apache Spark, it is okay and works, but it is not straightforward; I need to deal with it a little differently, and even after trying to do that differently, the problem I face there is, sometimes it will throw an error saying that it is looping back to the same, but I was not getting that kind of errors in Pandas.

In future updates, the tool should be made more user-friendly. I want to take fifty parallel processes rather than one, and I want to pick some particular columns to be split by partition, so if the tool is user-friendly and offers clarity and flexibility, then that will be good.

For how long have I used the solution?

I have been using Apache Spark for four years.

What do I think about the stability of the solution?

Stability-wise, I rate the solution a nine out of ten. The only issues with the tool revolve around user interaction and user flexibility.

What do I think about the scalability of the solution?

It is a scalable solution. Scalability-wise, I rate the solution an eight out of ten.

Around five people in my company use the tool.

How are customer service and support?

The solution's technical support is helpful, but I faced some problems which were more of a generic issue. If I face any problems which are non- generic issues, I get help from the tool's team. For the generic issues, I get answers mainly from the forums where the problem was already resolved. When it comes to some unknown problem or specific problem with my work, then the support takes time. I rate the technical support a seven out of ten.

How would you rate customer service and support?

Neutral

Which solution did I use previously and why did I switch?

I only work with Apache Spark.

How was the initial setup?

The product's initial setup phase was easy.

I managed the product's installation phase, both locally and on the cloud.

The solution is deployed on the on-premises version.

The solution can be deployed in two to three hours.

What was our ROI?

Apache Spark has helped save 50 percent of the operational costs. Time was reduced with the use of the tool, but the computing part increased. Overall, I can see that the tool's use has led to a 50 percent reduction in costs.

What's my experience with pricing, setup cost, and licensing?

I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten.

Which other solutions did I evaluate?

Previously, I was more of a Python full-stack developer, and I was happy dealing with PySpark libraries, which gave me an edge in continuing the work with Apache.

What other advice do I have?

Speaking about Apache Spark's use in our company's data processing workflows, I would say that when we deal with large datasets of data, if we don't use Spark, then when we try to use a data frame consisting of one year of data, it used to take me 45 minutes to an hour. Moreover, sometimes I used to get the memory out of space errors, but such issues were avoided the moment I started using Apache Spark, as I was able to get the whole processing done in less than five minutes, and there were no memory issues.

For big data processing, the tool's parallel processing and time are areas that have been helpful. When I try to apply a function, I can directly data write one code. Basically, I used Apache Spark to forecast multiple units at the same time, and if not with Apache Spark, I would be doing that one by one, which is more of a serial processing process that used to take me around five hours. At the moment, we use Apache Spark in parallel processing, where computing happens parallelly, and all these computations are cut down by at least 90 percent. It helps me significantly to reduce the time needed for operations.

The tool's real-time processing is an area that I have not tried to use much. When it comes to real-time processing of my data, I use Kafka.

I am handling data governance using Databricks Unity Catalog.

When I try to apply an ML model, I am unable to get that model done on a table partitioned by a particular column; it makes me get the job done in a reduced number of partitions. If I go with five partitions, I am able to get at least three to four times the benefits in a lesser amount of time.

Regular maintenance exists, but it is not like I have to sit week by week and upgrade a patch or something like that. The maintenance is done mostly in about six months to a year.

I take care of the tool's maintenance.

I recommend the tool to others.

I rate the tool an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user