Try our new research platform with insights from 80,000+ expert users
Sr Manager at a transportation company with 10,001+ employees
Real User
Dec 11, 2023
Offers real-time and near-real-time data processing
Pros and Cons
  • "We use it for ETL purposes as well as for implementing the full transformation pipelines."
  • "Apart from the restrictions that come with its in-memory implementation. It has been improved significantly up to version 3.0, which is currently in use."

What is our primary use case?

We use it for real-time and near-real-time data processing. We use it for ETL purposes as well as for implementing the full transformation pipelines.

What is most valuable?

There is no other platform that can challenge its features. Apart from the restrictions that come with its in-memory implementation.

What needs improvement?

Apart from the restrictions that come with its in-memory implementation. It has been improved significantly up to version 3.0, which is currently in use. 

Once I get those insights, I can let you know if the restrictions have been overcome. For example, there is an issue with heap memory getting full in version 1.6. There are other improvements in 3.0, so I will check those. 

In future releases, I would like to reduce the cost.

For how long have I used the solution?

We have been using this solution for 11 to 12 years. Now, it is deployed on the cloud or premises. Previously, it was on-premises when the version was below 1.6.

After version 1.6, it will be on the cloud. I have used it in all the major cloud providers: AWS, GCP, and Azure.

Buyer's Guide
Apache Spark
March 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2026.
884,706 professionals have used our research since 2012.

What do I think about the stability of the solution?

It is a stable solution, but when it comes to patch updates and different reports being updated, it can be a headache. 

When an application is built on top of certain reports with specific versions, and then the version changes, it can lead to a lot of things needing to be adjusted. This is something that definitely needs improvement.

How are customer service and support?

I contacted customer service and support six or seven years ago when Spark was still on version 1.6. 

We were struggling with memory limitations and the need for a lift and shift mechanism in a hybrid cloud mode. I contacted one or two people at that time.

How would you rate customer service and support?

Neutral

What's my experience with pricing, setup cost, and licensing?

It is quite expensive. In fact, it accounts for almost 50% of the cost of our entire project.

If I propose using Spark for a project, one of the first questions I get from management is about the cost of Databricks Spark on the cloud platform we're using, whether it's Azure, GCP, or AWS. If we could reduce the collection, system conversion, and transformation network costs by even just 2% to 3%, it would be a significant benefit for us.

What other advice do I have?

If your use case involves real-time applications frequently changing columns or data frames, then Spark is a fantastic option for you. 

However, if you have a batch process and don't have a structural data analysis, I would suggest avoiding it. The high cost of cloud infrastructure combined with Apache Spark can be a significant burden in such scenarios.

Overall, I would rate the solution a nine out of ten. 

Which deployment model are you using for this solution?

Public Cloud
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Bharghava Raghavendra Beesa - PeerSpot reviewer
Senior Developer at Infosys
MSP
Top 5Leaderboard
Jan 22, 2025
Faster data transformations achieved but scheduling dependencies require external solutions
Pros and Cons
  • "Spark is used for transformations from large volumes of data, and it is usefully distributed."
  • "The Spark solution could improve in scheduling tasks and managing dependencies."

What is our primary use case?

I have some hands-on experience with Spark. I have one year of experience that should be considered as one year working with Spark, which is six months to one year. We use it for faster processing, especially compute. 

Spark is used for transformations from large volumes of data, and it is usefully distributed. We receive data from various sources and need to transform it. The data is enormous, in terabytes, and often from specific databases. We perform transformations, aggregations, and deduplication. 

We meet business requirements by computing data, minimizing it, aggregating it, or performing other operations. We typically write to Hive downstream.

What is most valuable?

Spark is faster and distributed. Previously, everything relied on MapReduce, which was slower. With Spark, multiple computations and transformations hold in memory for faster processing. 

Real-time communication is possible, connecting with platforms like Kafka for real-time data import and compute. We implemented Spark and NiFi for integration. Spark replaced other costly products, reducing costs by thirty-eight percent.

What needs improvement?

The Spark solution could improve in scheduling tasks and managing dependencies. Spark alone cannot handle sequential tasks, requiring environments like Airflow scheduler or scripts. For instance, one task should trigger another based on completion, however, Spark can't manage these dependent loads. We focus on specific compute tasks that we can deliver.

For how long have I used the solution?

I have six months to one year of experience working with Spark.

What do I think about the stability of the solution?

Spark is stable, however, efficient use is necessary for running jobs seamlessly.

What do I think about the scalability of the solution?

Spark is scalable.

Which solution did I use previously and why did I switch?

I didn't work on any AI build projects for Spark, however, it supports external AI capabilities.

How was the initial setup?

The initial setup is complex. Logging methods require configuration, and it depends on matching with the cluster. Communicating within the node and setting up external logging supported by Spark are challenging.

Which other solutions did I evaluate?

On the compute side, I worked on Snowflake as well.

What other advice do I have?

I recommend Spark for working with large-scale big data. It is crucial to have skilled technicians. Overall product rating: seven out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Apache Spark
March 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2026.
884,706 professionals have used our research since 2012.
Head of Data Science center of excellence at Ameriabank CJSC
Real User
Top 5Leaderboard
Sep 25, 2024
Enhanced data processing with good support and helpful integration with Pandas syntax in distributed mode
Pros and Cons
  • "The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features."
  • "The main concern is the overhead of Java when distributed processing is not necessary."

What is our primary use case?

The primary use case for Apache Spark is to process data in memory, using big data, and distributing the engine to process said data. It is used for various tasks such as running the association rules algorithm in ML Spark ML, running XGBoost in parallel using the Spark engine, and preparing data for online machine learning using Spark Streaming mode.

How has it helped my organization?

The most significant cost savings come from the operational side because Spark is very typical in operations. There are many experts available in the market to operate Spark, making it easier to find the right personnel. It is quite mature, which reduces operation costs.

What is most valuable?

The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features. This allows running Pandas code distributed by using the Spark engine, which is a crucial feature. The integration with Pandas syntax in distributed mode, along with the user-defined functions in PySpark, is particularly valuable.

What needs improvement?

The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Consequently, alternatives like Doc DB are more preferable. Additionally, performance in some cases is slower, making alternatives two to five times faster.

For how long have I used the solution?

I have more than ten years of experience using Spark, starting from when it was first introduced.

What do I think about the stability of the solution?

Spark is very stable for our needs. It offers amazing stability.

What do I think about the scalability of the solution?

Scalability depends on how infrastructure is organized. Better balance and network considerations are necessary. However, Spark is very stable when scaled appropriately.

How are customer service and support?

Customer support for Apache Spark is very good. There is a lot of documentation and forums available, making it easier to find solutions. The Databricks team also does a lot to support Spark.

How would you rate customer service and support?

Positive

How was the initial setup?

The initial setup of Spark can take about a week, assuming the right infrastructure is already in place.

What about the implementation team?

A few technicians are typically required for installation and configuration. SRE engineers or operational guys handle the setup, as they need to understand the details about installation and configuration. Maintenance usually requires just an SRE engineer or operational guy.

What was our ROI?

The main benefit in terms of ROI comes from the operation side. Spark’s operational costs are lower due to the availability of experts and its maturity. However, performance costs might be higher due to the need for more memory and infrastructure.

What's my experience with pricing, setup cost, and licensing?

Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud solutions like Databricks can simplify the process, they may also be less cost-efficient.

What other advice do I have?

I'd rate the solution eight out of ten. 

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
SurjitChoudhury - PeerSpot reviewer
Data engineer at Cocos pt
Real User
Top 20Leaderboard
Mar 16, 2024
Offers batch processing of data and in-memory processing in Spark greatly enhances performance
Pros and Cons
  • "Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark. Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more."
  • "There could be enhancements in optimization techniques, as there are some limitations in this area that could be addressed to further refine Spark's performance."

What is our primary use case?

Our main use cases for Spark are Apache Spark SQL and sometimes Spark Streaming to process streaming data.

Like most solutions, we got data from SAP or Azure Data Warehouse. Suppose they were using Azure Cloud technology. So, the data comes from there, relational or sometimes semi-structured data like JSON files and all. 

So, we process the data with Spark, writing this code with PySpark, actually Python, which Spark allows, to create the data forms and all and load it into the Tableau format, basically. 

So, we try to load it into some database, like SQL Server or any other database. From there, the business data scientists or analysts pick up the data. So, any sort of different sources, basically, like e-commerce sites. 

So, previously, we used mostly structured data, which was stored in SAP, mainframe Oracle, or any other system provided in structured formats like CSV. 

Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark.

Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more.

Before Spark, there was MapReduce, but it was much slower. Even running the same query a second time would be time-consuming due to the I/O operations with disk storage. Spark was introduced to address these issues, offering processing speeds a hundred times faster than MapReduce, an initiative that saw contributions from Adobe Systems among others.

So, in response to the evolving needs of the industry, Spark has proven to be the solution, efficiently handling the processing requirements we face today.

What is most valuable?

Spark supports real-time data processing through Spark Streaming. It allows for batch processing of data. If you have immediate data, like chat information, that needs to be processed in real-time, Spark Streaming is used. 

For data that can be evaluated later, batch processing with Apache Spark is suitable. Mostly, batch processing is utilized in our organization, but for streaming data processing, tools like Kafka are often integrated.

In-memory processing in Spark greatly enhances performance, making it a hundred times faster than the previous MapReduce methods. This improvement is achieved through optimization techniques like caching, broadcasting, and partitioning, which help in optimizing queries for faster processing.

What needs improvement?

There could be enhancements in optimization techniques, as there are some limitations in this area that could be addressed to further refine Spark's performance.

For how long have I used the solution?

I've used it for four years. 

How are customer service and support?

In the community forums, I asked questions a while back when I was new. However, the responses came from other users in the community, not the official Apache Spark organization. So, I am not sure about the proficiency. 

Since it's open-source, most questions happen in the community. For enterprise support, I imagine the response speed would be different. 

Which solution did I use previously and why did I switch?

I have also used Hadoop.

The main reason for choosing Apache Spark was for big data solutions. Hadoop was introduced earlier, and most organizations were using Hadoop or cloud data platforms. 

Then, Apache Spark came into the picture, and it was much faster. It's kind of taking the place of Hadoop. Organizations using Hadoop are now primarily focusing on Apache Spark for support.

So, for big data computing tasks, what you do with Hadoop is like a top-level layer. Spark is another layer on top of that. Organizations using Hadoop technologies and big data technologies in general have adopted Spark. 

There aren't really other comparable tools for big data computing tasks. But, resource managers like Kubernetes and YARN are used with Spark. YARN was used in Hadoop big data technology, but now Kubernetes is more commonly used for resource management.

How was the initial setup?

Resource allocation and optimization in the computing tasks are different for on-premise systems. 

In cloud environments, resource allocation is already handled by the cloud provider, so you don't need to worry about it. 

On-prem, if you're using Hadoop with Spark, resource allocation might be handled by Kubernetes or YARN. These tools provide feedback to the Spark driver about available resources, and the driver allocates tasks to worker nodes based on that information.

What other advice do I have?

Overall, I would rate the solution a nine out of ten. 

I would recommend this tool to someone considering it for scalable data processing.

Nowadays, Apache Spark is on the market, and most organizations are using it. There are people with more experience and knowledge than me, and they're confident about this tool. 

That's why it's become a solution for organizations. It's not a one-man decision but rather a group or community effort.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Atal Upadhyay - PeerSpot reviewer
AVP at MIDDAY INFOMEDIA LIMITED
Real User
Top 5Leaderboard
Apr 8, 2024
Allows us to consume data from any data source and has a remarkable processing power
Pros and Cons
  • "With Spark, we parallelize our operations, efficiently accessing both historical and real-time data."
  • "It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework."

What is our primary use case?

We pull data from various sources and employ a buzzword to process it for reporting purposes, utilizing a prominent visual analytics tool.

How has it helped my organization?

Our experience with using Spark for machine learning and big data analytics allows us to consume data from any data source, including freely available data. The processing power of Spark is remarkable, making it our top choice for file-processing tasks.

Utilizing Apache Spark's in-memory processing capabilities significantly enhances our computational efficiency. Unlike with Oracle, where customization is limited, we can tailor Spark to our needs. This allows us to pull data, perform tests, and save processing power. We maintain a historical record by loading intermediate results and retrieving data from previous iterations, ensuring our applications operate seamlessly. With Spark, we parallelize our operations, efficiently accessing both historical and real-time data.

We utilize Apache Spark for our data analysis tasks. Our data processing pipeline starts with receiving data in the RAV format. We employ a data factory to create pipelines for data processing. This ensures that the data is prepared and made ready for various purposes, such as supporting applications or analysis.

There are instances where we perform data cleansing operations and manage the database, including indexing. We've implemented automated tasks to analyze data and optimize performance, focusing specifically on database operations. These efforts are independent of the Spark platform but contribute to enhancing overall performance.

What needs improvement?

It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework.

For how long have I used the solution?

I've been engaged with Apache Spark for about a year now, but my company has been utilizing it for over a decade.

What do I think about the stability of the solution?

It offers a high level of stability. I would rate it nine out of ten.

What do I think about the scalability of the solution?

It serves as a data node, making it highly scalable. It caters to a user base of around five thousand or so.

How was the initial setup?

The initial setup isn't complicated, but it varies from person to person. For me, it wasn't particularly complex; it was straightforward to use.

What about the implementation team?

Once the solution is prepared, we deploy it onto both the staging server and the production server. Previously, we had a dedicated individual responsible for deploying the solution across multiple machines. We manage three environments: development, staging, and production. The deployment process varies, sometimes utilizing a tenant model and other times employing blue-green deployment, depending on the situation. This ensures the seamless setup of servers and facilitates smooth operations.

What other advice do I have?

Given our extensive experience with it and its ability to meet all our requirements over time, I highly recommend it. Overall, I would rate it nine out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Miodrag Milojevic - PeerSpot reviewer
Senior Data Archirect at Yettel
Real User
Top 20
Aug 18, 2023
Parallel computing helped create data lakes with near real-time loading
Pros and Cons
  • "It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."
  • "If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation."

What is our primary use case?

I use the solution for data lakes and big data solutions. I can combine it with the other program languages.

What is most valuable?

One of the reasons we use Spark is so we can use parallelism in data lakes. So in our case, we can get many data nodes, and the main power of Hadoop and big data solutions is the number of nodes usable for different operations. It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance. Also, Spark has an option for near real-time loading and processing. We use micro batches of Spark.

What needs improvement?

If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation. In combination with other tools, many sessions remain, even if you think they've stopped. This is the main problem with big data sessions, where zombie sessions reside that you have to take care of. Otherwise, they spend resources and cause problems.

For how long have I used the solution?

I've been using Apache Spark for more than two years. I'm using the latest version.

What do I think about the stability of the solution?

The solution is stable, but not completely. For example, we use Exadata as an extremely stable data warehouse, but that's not possible with big data. There are things that you have to fix sometimes. The stability is similar to the cloud solution, but that depends on the solution you need.

What do I think about the scalability of the solution?

The solution is scalable, but adding new nodes is not easy. It will take some time to do that, but it's scalable. We have about 20 users using Apache Spark. We regularly use the solution.

How are customer service and support?

We use Cloudera distribution, so we ask Cloudera for support, which is not open-source.

How was the initial setup?

When you install the complete environment, you install Spark as a part of this solution. The setup can be tricky when introducing security, such as connecting Spark using Kerberos. It can be tricky because when you use it, you have to distribute your architecture with many servers, and even then, you have to prepare Kerberos on every server. It's not possible to do this in one place.

Deploying Apache Spark is pretty complex. But that is a problem with the security approach. Our security guys requested this security, so we use Kerberos authentication mandatorily, which can be complex. We had five people for maintenance and deployment, not to mention deployment or other roles.

What about the implementation team?

We had an external integrator, but we also had in-house knowledge. Sometimes, we need to change or install something, and it's not good to ask the integrator for everything because of availability and planning. We had more freedom thanks to our internal knowledge.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera. But in that case, you don't have any support. If you face a problem, you might find something in the community, but you cannot ask Cloudera about it. If you have open source, you don't have support, but you have a community. Cloudera has different packages, which are licensed versions of products like Apache Spark. In this case, you can ask Cloudera for everything.

What other advice do I have?

Spark was written in Scala. Scala is a programming language fundamentally in Java and useful for data lakes.

We thought about using Flink instead, but it wasn't useful for us and wouldn't gain any additional value. Besides, Spark's community is much wider, so information is available and is better than Flink's.

I rate Apache Spark an eight out of ten.

If you plan to implement Apache Spark on a large-scale system, you should learn to use parallelism, partitioning, and everything from the physical level to get the best performance from Spark. And it will be good to know Python, especially for data scientists using PySpark for analysis. Likewise, it's good to know Scala because you can be very efficient in preparing some datasets since it is Spark's native language.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Hamid M. Hamid - PeerSpot reviewer
Data architect at Banking Sector
Real User
Feb 12, 2024
Along with the easy cluster deployment process, the tool also has the ability to process huge datasets
Pros and Cons
  • "The deployment of the product is easy."
  • "Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users."

What is our primary use case?

In my company, the solution is used for batch processing or real-time processing.

What needs improvement?

The product has matured at the moment. The product's interoperability is an area of concern where improvements are required.

Apache Spark can be integrated with high-tech tools like Informatica. Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users.

For how long have I used the solution?

I have been using Apache Spark for three years.

What do I think about the stability of the solution?

Stability-wise, I rate the solution a nine out of ten.

What do I think about the scalability of the solution?

It is a very scalable solution. Scalability-wise, I rate the solution a nine out of ten.

There are no different numbers of uses for Apache Spark in my company since it is used as a processing engine.

How are customer service and support?

Apache Spark is an open-source tool, so the only support users can get for the tool is from different vendors like Cloudera or HPE.

Which solution did I use previously and why did I switch?

In the past, my company has used certain ETL tools, like Informatica, based on the performance levels offered.

How was the initial setup?

The deployment of the product is easy.

Apache Spark's cluster deployment process is very easy.

There is only a deployment process required for an application to run on Apache Spark. Apache Spark itself is a setup tool. Deploying an application using Apache Spark is easy as a user since you just need to submit the code in Scala and submit it to the cluster, and then the deployment process can be done in one step.

The solution is deployed on an on-premises model.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source tool. It is not an expensive product.

What other advice do I have?

The tool is used for real-time data analytics as it is very powerful and reliable. The code that you write with Apache Spark provides stability. There are many bugs that can appear according to the code that you use, which could be Java or Scala. So this is amazing. Apache Spark is very reliable, powerful, and fast as an engine. When compared with another competitor like MapReduce, Apache Spark performs 100 times better than MapReduce.

The monitoring part of the product is good.

The product offers clusters that are resilient and can run into multiple nodes.

The tool can run with multiple clusters.

The integration capabilities of the product with other platforms to improve our company's workflow are good.

In terms of the improvements in the product in the data analysis area, new libraries have been launched to support AI and machine learning.

My company is able to process huge datasets with Apache Spark. There is a huge value added to the organization because of the tool's ability to process huge datasets.

I rate the overall solution a nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Ilya Afanasyev - PeerSpot reviewer
Senior Software Development Engineer at Yahoo!
Real User
Aug 22, 2022
Reliable, able to expand, and handle large amounts of data well
Pros and Cons
  • "There's a lot of functionality."
  • "I know there is always discussion about which language to write applications in and some people do love Scala. However, I don't like it."

What is our primary use case?

It's a root product that we use in our pipeline.

We have some input data. For example, we have one system that supplies some data to MongoDB, for example, and we pull this data from MongoDB, enrich this data from other systems - with some additional fields - and write to S3 for other systems. Since we have a lot of data, we need a parallel process that runs hourly.

What is most valuable?

We use batch processing. It works well with our formats and file versions. There's a lot of functionality. 

In our pipeline each hour, we make a copy of data from MongoDB, of the changes from MongoDB to some specific file. Each time pipeline copied all of the data, it would do it each time without changes to all of the tables. Tables have a lot of data, and in the last MongoDB version, there is a possibility to read only changed data. This reduced the cost and configuration of the cluster, and we saved about $150,000.

The solution is scalable.

It's a stable product.

What needs improvement?

The primary language for developers on Spark is Scala. Now it's also about Java. I prefer Java versus Scala, and since they are supported, it is good. I know there is always discussion about which language to write applications in, and some people do love Scala. However, I don't like it.

They use currently have a JDK version which is a little bit old. Not all features are on it. Maybe they should pull support of the JDK version.

For how long have I used the solution?

I've used the solution for a year and a half. 

What do I think about the stability of the solution?

The solution is stable. There are no bugs or glitches. It doesn't crash or freeze. 

What do I think about the scalability of the solution?

The product scales well. It's fine to expand if needed. 

Many teams use Spark. For example, we have a few kinds of pipelines, huge pipelines. One of them processes 300 billion events each day. It's our core technology currently.

We do not plan to increase usage. We keep our legacy system on Spark, and we are now discussing Flink and Spark and what we would prefer. However, most of the people are already migrating new systems to Flink. We will keep Spark for a few more years still. 

How are customer service and support?

We have an internal team, and they participate in process of developing Spark. They are Spark contributors, and if we have some problems, we turn to them. It's our own people, yet they work with Spark. Generally, if the problem is more minor, we look at some sites or have some discussion about Spark or internal guys who have experience with Spark. 

Which solution did I use previously and why did I switch?

We also use Flink.

Before Spark, I worked with another company that we used some different technology, including Kafka, Radius, Postgres SQL, S3, and Spring. 

How was the initial setup?

I didn't handle the initial setup. We were using this pipeline and clusters already. I just installed it on my local server. However, in terms of difficulty, I didn't see any problem. The deployment might only take a few hours. 

I found some documentation. I got the documentation from the site and downloaded the archive and unzipped it, and installed it. I can't say that I installed something from a special configuration. I just installed a few nodes for debugging and for running locally, and that's all. Also, in one case I used, for example, a Docker configuration with Spark. It all worked fine.

What's my experience with pricing, setup cost, and licensing?

It's an open-source product. I don't know much about the licensing aspect. 

Which other solutions did I evaluate?

We have compared Flink and Spark as two possible options. 

What other advice do I have?

I can recommend the product. It's a nice system for batch processing huge data.

I'd rate the solution eight out of ten. 

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: March 2026
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.