Try our new research platform with insights from 80,000+ expert users
Head of Data Science center of excellence at a financial services firm with 501-1,000 employees
Real User
Top 5Leaderboard
Sep 25, 2024
Enhanced data processing with good support and helpful integration with Pandas syntax in distributed mode
Pros and Cons
  • "The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features."
  • "The main concern is the overhead of Java when distributed processing is not necessary."

What is our primary use case?

The primary use case for Apache Spark is to process data in memory, using big data, and distributing the engine to process said data. It is used for various tasks such as running the association rules algorithm in ML Spark ML, running XGBoost in parallel using the Spark engine, and preparing data for online machine learning using Spark Streaming mode.

How has it helped my organization?

The most significant cost savings come from the operational side because Spark is very typical in operations. There are many experts available in the market to operate Spark, making it easier to find the right personnel. It is quite mature, which reduces operation costs.

What is most valuable?

The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features. This allows running Pandas code distributed by using the Spark engine, which is a crucial feature. The integration with Pandas syntax in distributed mode, along with the user-defined functions in PySpark, is particularly valuable.

What needs improvement?

The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Consequently, alternatives like Doc DB are more preferable. Additionally, performance in some cases is slower, making alternatives two to five times faster.

Buyer's Guide
Apache Spark
January 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: January 2026.
880,435 professionals have used our research since 2012.

For how long have I used the solution?

I have more than ten years of experience using Spark, starting from when it was first introduced.

What do I think about the stability of the solution?

Spark is very stable for our needs. It offers amazing stability.

What do I think about the scalability of the solution?

Scalability depends on how infrastructure is organized. Better balance and network considerations are necessary. However, Spark is very stable when scaled appropriately.

How are customer service and support?

Customer support for Apache Spark is very good. There is a lot of documentation and forums available, making it easier to find solutions. The Databricks team also does a lot to support Spark.

How would you rate customer service and support?

Positive

How was the initial setup?

The initial setup of Spark can take about a week, assuming the right infrastructure is already in place.

What about the implementation team?

A few technicians are typically required for installation and configuration. SRE engineers or operational guys handle the setup, as they need to understand the details about installation and configuration. Maintenance usually requires just an SRE engineer or operational guy.

What was our ROI?

The main benefit in terms of ROI comes from the operation side. Spark’s operational costs are lower due to the availability of experts and its maturity. However, performance costs might be higher due to the need for more memory and infrastructure.

What's my experience with pricing, setup cost, and licensing?

Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud solutions like Databricks can simplify the process, they may also be less cost-efficient.

What other advice do I have?

I'd rate the solution eight out of ten. 

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Cloud solution architect at a consultancy with 1-10 employees
Real User
Top 5Leaderboard
Mar 10, 2024
Offers seamless integration with Azure services and on-premises servers
Pros and Cons
  • "The solution is scalable."
  • "The setup I worked on was really complex."

What is our primary use case?

My contribution primarily focused on the networking aspect, ensuring secure and reliable connections between Azure services and on-premises servers. The solution was complex, involving private links, virtual machines, and custom firewall rules to facilitate secure data transmission.

I use Apache Spark, especially for data processing and analytics.  My work involves a broad range of technologies, including PostgreSQL, Apache Kafka, Spark, and various Azure services. Previously, my focus was more on networking, cybersecurity, and Azure's data services like SQL and Active Directory.

How has it helped my organization?

We've set up a Spark cluster running in Azure to process real-time data. This setup involves connecting Azure applications to the Spark cluster via Azure Private Link, ensuring secure data flow. 

The architecture required detailed network design, including routing through Linux firewalls and ensuring data could be securely transmitted to and from on-premises servers. 

While I was heavily involved in the network design aspect, the Spark cluster was primarily used for processing and analyzing data streams for various applications.

Moreover, from my experience, I haven't encountered significant challenges with integrations involving Spark. The crucial factor is having established connectivity. 

Whether Spark is operating in Azure or on-premises doesn't significantly affect our operations, thanks to high-bandwidth solutions like ExpressRoute. The main consideration then becomes the cost. As long as we maintain performance standards, I don't see any issues, regardless of the deployment environment.

Ensuring the collection of relevant metrics and logs is critical for assessing performance improvements. The specifics of how these are collected or which tools are used might vary, but the goal is to gather comprehensive data for ongoing monitoring and improvement.

What is most valuable?

What I liked about the solution was its uniqueness. We provided the customer with a solution that hadn't been offered by anyone else before. 

It involved multiple components, such as Spark cluster, CMAX, a backend VM, and a Linux VM for mapping the service processes to the backend, which is running on-premises where the Kafka service was running. 

It was challenging for people to understand how to send traffic through the private link between all these services. Ensuring the traffic was sent to the correct destination with the correct source header without any operation issues was complex, but we achieved it.

We had multiple instances of fault tolerance and scalability.  

What needs improvement?

The setup I worked on was really complex.

For how long have I used the solution?

I have been using it for a year. 

What do I think about the stability of the solution?

The solution was definitely stable. There were no unstable services in it. Since most services were in Azure, everything worked better. 

Azure's networking products, like ExpressRoute and Private Link service, are very stable. We didn't encounter any issues with the solution. 

It took some time to complete, but after that, we haven't had a single support case.

What do I think about the scalability of the solution?

The solution is scalable. We used a load balancer at each tier, with multiple instances of the services running. 

It's all scalable and relevant. We didn't have a lot of issues and have been monitoring the traffic flow. 

We even projected the requests for the next two to three years and created scalable instances accordingly.

There are many users of Spark in our organization. For example, many customers are using Spark, often in conjunction with requests from third-party vendors. They frequently use Spark plug-ins as well.

Which solution did I use previously and why did I switch?

I've been exploring its capabilities in the OpenAI context, rather than dealing with external databases. 

I've also started using Apache Kafka for messaging and event streaming, which is essential since our solutions often integrate with applications running in Azure, including event hubs and service bus for messaging. This experience includes interfacing with various technologies, not just within Microsoft's ecosystem but also with Amazon Web Services.

Learning new technologies is a continuous process, and I've never found it difficult to adapt, especially with something as foundational as Apache Kafka.

How was the initial setup?

The setup I worked on was really complex, not specifically because of Spark but due to the integration with multiple services. 

It took us about a week to finalize the solution, as understanding the entire workflow and brainstorming on how to maintain private traffic was intricate.

Regarding the deployment process, it involved thorough planning and testing to ensure minimal latency. We managed to achieve a latency of around 20 to 30 milliseconds, which was pretty good.

What about the implementation team?

For the deployment process, once we have a clear understanding of the workflow, the services to be included, how they should be integrated, the policies, and the configurations to be applied, it becomes easier to structure and incorporate it into the ops pipeline. 

We may need to standardize it a bit based on different customer requirements. This standardization allows customers to apply the necessary customizations once it's deployed.

It's a hybrid solution, with about 90% of the services running in the cloud and 10%  on-premises.

What's my experience with pricing, setup cost, and licensing?

The licensing costs for Spark would depend on the specific packages and the needs of the project. Costs can vary based on requirements, affordability, and customer expectations. 

Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure. 

If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources. 

The licensing arrangements can differ based on the product and service. Some products might require a license purchase upfront, with subsequent charges based only on usage. 

The availability of hybrid benefits can also influence licensing costs, especially if you're using third-party services like Palo Alto in a VM from the marketplace. If you have an existing license, your costs could be reduced, but purchasing a new license would include licensing fees in the overall cost.

What other advice do I have?

My advice is to thoroughly understand your own needs and environment before making a decision. Recommendations should be based on product features, quality, accuracy, and stability. 

Cost is also a factor, but it should not be the only consideration. Depending on whether the priority is performance and scalability or cost-effectiveness, I would suggest a solution that best meets those needs, whether it's a managed service or a more cost-conscious option.

I would rate Spark as ten out of ten. I haven't had any issues with Spark in my experience.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Apache Spark
January 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: January 2026.
880,435 professionals have used our research since 2012.
Hamid M. Hamid - PeerSpot reviewer
Data architect at a university with 5,001-10,000 employees
Real User
Top 5Leaderboard
Feb 12, 2024
Along with the easy cluster deployment process, the tool also has the ability to process huge datasets
Pros and Cons
  • "The deployment of the product is easy."
  • "Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users."

What is our primary use case?

In my company, the solution is used for batch processing or real-time processing.

What needs improvement?

The product has matured at the moment. The product's interoperability is an area of concern where improvements are required.

Apache Spark can be integrated with high-tech tools like Informatica. Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users.

For how long have I used the solution?

I have been using Apache Spark for three years.

What do I think about the stability of the solution?

Stability-wise, I rate the solution a nine out of ten.

What do I think about the scalability of the solution?

It is a very scalable solution. Scalability-wise, I rate the solution a nine out of ten.

There are no different numbers of uses for Apache Spark in my company since it is used as a processing engine.

How are customer service and support?

Apache Spark is an open-source tool, so the only support users can get for the tool is from different vendors like Cloudera or HPE.

Which solution did I use previously and why did I switch?

In the past, my company has used certain ETL tools, like Informatica, based on the performance levels offered.

How was the initial setup?

The deployment of the product is easy.

Apache Spark's cluster deployment process is very easy.

There is only a deployment process required for an application to run on Apache Spark. Apache Spark itself is a setup tool. Deploying an application using Apache Spark is easy as a user since you just need to submit the code in Scala and submit it to the cluster, and then the deployment process can be done in one step.

The solution is deployed on an on-premises model.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source tool. It is not an expensive product.

What other advice do I have?

The tool is used for real-time data analytics as it is very powerful and reliable. The code that you write with Apache Spark provides stability. There are many bugs that can appear according to the code that you use, which could be Java or Scala. So this is amazing. Apache Spark is very reliable, powerful, and fast as an engine. When compared with another competitor like MapReduce, Apache Spark performs 100 times better than MapReduce.

The monitoring part of the product is good.

The product offers clusters that are resilient and can run into multiple nodes.

The tool can run with multiple clusters.

The integration capabilities of the product with other platforms to improve our company's workflow are good.

In terms of the improvements in the product in the data analysis area, new libraries have been launched to support AI and machine learning.

My company is able to process huge datasets with Apache Spark. There is a huge value added to the organization because of the tool's ability to process huge datasets.

I rate the overall solution a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Sr Manager at a transportation company with 10,001+ employees
Real User
Dec 11, 2023
Offers real-time and near-real-time data processing
Pros and Cons
  • "We use it for ETL purposes as well as for implementing the full transformation pipelines."
  • "Apart from the restrictions that come with its in-memory implementation. It has been improved significantly up to version 3.0, which is currently in use."

What is our primary use case?

We use it for real-time and near-real-time data processing. We use it for ETL purposes as well as for implementing the full transformation pipelines.

What is most valuable?

There is no other platform that can challenge its features. Apart from the restrictions that come with its in-memory implementation.

What needs improvement?

Apart from the restrictions that come with its in-memory implementation. It has been improved significantly up to version 3.0, which is currently in use. 

Once I get those insights, I can let you know if the restrictions have been overcome. For example, there is an issue with heap memory getting full in version 1.6. There are other improvements in 3.0, so I will check those. 

In future releases, I would like to reduce the cost.

For how long have I used the solution?

We have been using this solution for 11 to 12 years. Now, it is deployed on the cloud or premises. Previously, it was on-premises when the version was below 1.6.

After version 1.6, it will be on the cloud. I have used it in all the major cloud providers: AWS, GCP, and Azure.

What do I think about the stability of the solution?

It is a stable solution, but when it comes to patch updates and different reports being updated, it can be a headache. 

When an application is built on top of certain reports with specific versions, and then the version changes, it can lead to a lot of things needing to be adjusted. This is something that definitely needs improvement.

How are customer service and support?

I contacted customer service and support six or seven years ago when Spark was still on version 1.6. 

We were struggling with memory limitations and the need for a lift and shift mechanism in a hybrid cloud mode. I contacted one or two people at that time.

How would you rate customer service and support?

Neutral

What's my experience with pricing, setup cost, and licensing?

It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project.

If I propose using Spark for a project, one of the first questions I get from management is about the cost of Databricks Spark on the cloud platform we're using, whether it's Azure, GCP, or AWS. If we could reduce the collection, system conversion, and transformation network costs by even just 2% to 3%, it would be a significant benefit for us.

What other advice do I have?

If your use case involves real-time applications frequently changing columns or data frames, then Spark is a fantastic option for you. 

However, if you have a batch process and don't have a structural data analysis, I would suggest avoiding it. The high cost of cloud infrastructure combined with Apache Spark can be a significant burden in such scenarios.

Overall, I would rate the solution a nine out of ten. 

Which deployment model are you using for this solution?

Public Cloud
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer2534727 - PeerSpot reviewer
Manager Data Analytics at a consultancy with 10,001+ employees
Real User
Top 20
Aug 13, 2024
A flexible solution with real-time processing capabilities
Pros and Cons
  • "I like Apache Spark's flexibility the most. Before, we had one server that would choke up. With the solution, we can easily add more nodes when needed. The machine learning models are also really helpful. We use them to predict energy theft and find infrastructure problems."
  • "For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial."

What is our primary use case?

We use the solution to extract data from our sensors.  We have lots of data streaming into our system, which used to get overwhelmed. We use Apache Spark to handle real-time streaming and do machine learning to predict supply and demand in the market and adjust operations.

What is most valuable?

I like Apache Spark's flexibility the most. Before, we had one server that would choke up. With the solution, we can easily add more nodes when needed. The machine learning models are also really helpful. We use them to predict energy theft and find infrastructure problems.

The tool's real-time processing has had a big impact. We used to get data from sensors after a month. We get it in less than 10 minutes, which helps us take quick action.

We use Apache Spark to map our data pipelines using MapReduce technology. We're also working on integrating tools like Hive with Apache Spark to distribute our data processing. We can also integrate other tools like Apache Kafka and Hadoop.

We faced some challenges when integrating the solution into our existing system, but good documentation helped solve them.

What needs improvement?

For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial.

For how long have I used the solution?

I have been working with the product for five years. 

What do I think about the stability of the solution?

Apache Spark is stable. 

What do I think about the scalability of the solution?

We're a big company with about 4 million consumers. We handle huge amounts of data—around 30,000 sensors send data every 15 minutes, which adds up to 5-10 terabytes per day.

Which solution did I use previously and why did I switch?

Before Apache Spark, we had a different solution - a traditional system with one server handling everything, more like a data warehouse. We switched to Apache Spark because we needed real-time visibility in our operations.

How was the initial setup?

The initial setup process was challenging. We tried to do it ourselves at first, but we weren't used to distributed computing systems, creating nodes, and distributing data. Later, we engaged consulting groups that specialized in it. This is why there's a specific learning curve—it would be challenging for a company to start alone.

The initial deployment took us about six to eight months. We started with three people involved in the deployment process and later increased to five. From a maintenance point of view, it's pretty smooth now. It's not difficult to maintain and doesn't require much maintenance.

What was our ROI?

The tool has helped us reduce costs that run into billions of dollars yearly. The ROI is very significant for us.

Which other solutions did I evaluate?

We did evaluate other options. We started by looking at open-source Hadoop deployment, thinking we'd bring data into HDFS and do machine learning separately. But that would have been a hassle, so Apache Spark was a better fit.

What other advice do I have?

I rate the overall solution a seven out of ten. I would recommend Apache Spark to other users, but it depends on their use cases. I advise new users to get an expert involved from the start.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
UjjwalGupta - PeerSpot reviewer
Module Lead at a tech vendor with 10,001+ employees
Real User
Top 5Leaderboard
Mar 14, 2024
Helps to build ETL pipelines load data to warehouses
Pros and Cons
  • "The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily."
  • "Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial."

What is our primary use case?

We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.

What is most valuable?

The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily.

What needs improvement?

Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial.

For how long have I used the solution?

I have been using the product for six years. 

What do I think about the stability of the solution?

Apache Spark is generally considered a stable product, with rare instances of breaking down. Issues may arise in sudden increases in data volume, leading to memory errors, but these can typically be managed with autoscaling clusters. Additionally, schema changes or irregularities in streaming data may pose challenges, but these could be addressed in future software versions.

What do I think about the scalability of the solution?

About 70-80 percent of employees in my company use the product. 

How are customer service and support?

We haven't contacted Apache Spark support directly because it's an open-source tool. However, when using it as a product within Databricks, we've contacted Databricks support for assistance.

Which solution did I use previously and why did I switch?

The main reason our company opted for the product is its capability to process large volumes of data. While other options like Snowflake offer some advantages, they may have limitations regarding custom logic or modifications.

How was the initial setup?

The solution's setup and installation of Apache Spark can vary in complexity depending on whether it's done in a standalone or cluster environment. The process is generally more straightforward in a standalone setup, especially if you're familiar with the concepts involved. However, setting up in a cluster environment may require more knowledge about clusters and networking, making it potentially more complex.

What's my experience with pricing, setup cost, and licensing?

The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks.

What other advice do I have?

If you're new to Apache Spark, the best way to learn is by using the Databricks Community Edition. It provides a cluster for Apache Spark where you can learn and test. I rate the product an eight out of ten.

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Atal Upadhyay - PeerSpot reviewer
AVP at a paper AND forest products with 1,001-5,000 employees
Real User
Top 5Leaderboard
Apr 8, 2024
Allows us to consume data from any data source and has a remarkable processing power
Pros and Cons
  • "With Spark, we parallelize our operations, efficiently accessing both historical and real-time data."
  • "It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework."

What is our primary use case?

We pull data from various sources and employ a buzzword to process it for reporting purposes, utilizing a prominent visual analytics tool.

How has it helped my organization?

Our experience with using Spark for machine learning and big data analytics allows us to consume data from any data source, including freely available data. The processing power of Spark is remarkable, making it our top choice for file-processing tasks.

Utilizing Apache Spark's in-memory processing capabilities significantly enhances our computational efficiency. Unlike with Oracle, where customization is limited, we can tailor Spark to our needs. This allows us to pull data, perform tests, and save processing power. We maintain a historical record by loading intermediate results and retrieving data from previous iterations, ensuring our applications operate seamlessly. With Spark, we parallelize our operations, efficiently accessing both historical and real-time data.

We utilize Apache Spark for our data analysis tasks. Our data processing pipeline starts with receiving data in the RAV format. We employ a data factory to create pipelines for data processing. This ensures that the data is prepared and made ready for various purposes, such as supporting applications or analysis.

There are instances where we perform data cleansing operations and manage the database, including indexing. We've implemented automated tasks to analyze data and optimize performance, focusing specifically on database operations. These efforts are independent of the Spark platform but contribute to enhancing overall performance.

What needs improvement?

It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework.

For how long have I used the solution?

I've been engaged with Apache Spark for about a year now, but my company has been utilizing it for over a decade.

What do I think about the stability of the solution?

It offers a high level of stability. I would rate it nine out of ten.

What do I think about the scalability of the solution?

It serves as a data node, making it highly scalable. It caters to a user base of around five thousand or so.

How was the initial setup?

The initial setup isn't complicated, but it varies from person to person. For me, it wasn't particularly complex; it was straightforward to use.

What about the implementation team?

Once the solution is prepared, we deploy it onto both the staging server and the production server. Previously, we had a dedicated individual responsible for deploying the solution across multiple machines. We manage three environments: development, staging, and production. The deployment process varies, sometimes utilizing a tenant model and other times employing blue-green deployment, depending on the situation. This ensures the seamless setup of servers and facilitates smooth operations.

What other advice do I have?

Given our extensive experience with it and its ability to meet all our requirements over time, I highly recommend it. Overall, I would rate it nine out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Miodrag Milojevic - PeerSpot reviewer
Senior Data Archirect at a comms service provider with 1,001-5,000 employees
Real User
Top 20
Aug 18, 2023
Parallel computing helped create data lakes with near real-time loading
Pros and Cons
  • "It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."
  • "If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation."

What is our primary use case?

I use the solution for data lakes and big data solutions. I can combine it with the other program languages.

What is most valuable?

One of the reasons we use Spark is so we can use parallelism in data lakes. So in our case, we can get many data nodes, and the main power of Hadoop and big data solutions is the number of nodes usable for different operations. It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance. Also, Spark has an option for near real-time loading and processing. We use micro batches of Spark.

What needs improvement?

If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation. In combination with other tools, many sessions remain, even if you think they've stopped. This is the main problem with big data sessions, where zombie sessions reside that you have to take care of. Otherwise, they spend resources and cause problems.

For how long have I used the solution?

I've been using Apache Spark for more than two years. I'm using the latest version.

What do I think about the stability of the solution?

The solution is stable, but not completely. For example, we use Exadata as an extremely stable data warehouse, but that's not possible with big data. There are things that you have to fix sometimes. The stability is similar to the cloud solution, but that depends on the solution you need.

What do I think about the scalability of the solution?

The solution is scalable, but adding new nodes is not easy. It will take some time to do that, but it's scalable. We have about 20 users using Apache Spark. We regularly use the solution.

How are customer service and support?

We use Cloudera distribution, so we ask Cloudera for support, which is not open-source.

How was the initial setup?

When you install the complete environment, you install Spark as a part of this solution. The setup can be tricky when introducing security, such as connecting Spark using Kerberos. It can be tricky because when you use it, you have to distribute your architecture with many servers, and even then, you have to prepare Kerberos on every server. It's not possible to do this in one place.

Deploying Apache Spark is pretty complex. But that is a problem with the security approach. Our security guys requested this security, so we use Kerberos authentication mandatorily, which can be complex. We had five people for maintenance and deployment, not to mention deployment or other roles.

What about the implementation team?

We had an external integrator, but we also had in-house knowledge. Sometimes, we need to change or install something, and it's not good to ask the integrator for everything because of availability and planning. We had more freedom thanks to our internal knowledge.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera. But in that case, you don't have any support. If you face a problem, you might find something in the community, but you cannot ask Cloudera about it. If you have open source, you don't have support, but you have a community. Cloudera has different packages, which are licensed versions of products like Apache Spark. In this case, you can ask Cloudera for everything.

What other advice do I have?

Spark was written in Scala. Scala is a programming language fundamentally in Java and useful for data lakes.

We thought about using Flink instead, but it wasn't useful for us and wouldn't gain any additional value. Besides, Spark's community is much wider, so information is available and is better than Flink's.

I rate Apache Spark an eight out of ten.

If you plan to implement Apache Spark on a large-scale system, you should learn to use parallelism, partitioning, and everything from the physical level to get the best performance from Spark. And it will be good to know Python, especially for data scientists using PySpark for analysis. Likewise, it's good to know Scala because you can be very efficient in preparing some datasets since it is Spark's native language.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: January 2026
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.