Try our new research platform with insights from 80,000+ expert users
reviewer2534727 - PeerSpot reviewer
Manager Data Analytics at a consultancy with 10,001+ employees
Real User
A flexible solution with real-time processing capabilities
Pros and Cons
  • "I like Apache Spark's flexibility the most. Before, we had one server that would choke up. With the solution, we can easily add more nodes when needed. The machine learning models are also really helpful. We use them to predict energy theft and find infrastructure problems."
  • "For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial."

What is our primary use case?

We use the solution to extract data from our sensors.  We have lots of data streaming into our system, which used to get overwhelmed. We use Apache Spark to handle real-time streaming and do machine learning to predict supply and demand in the market and adjust operations.

What is most valuable?

I like Apache Spark's flexibility the most. Before, we had one server that would choke up. With the solution, we can easily add more nodes when needed. The machine learning models are also really helpful. We use them to predict energy theft and find infrastructure problems.

The tool's real-time processing has had a big impact. We used to get data from sensors after a month. We get it in less than 10 minutes, which helps us take quick action.

We use Apache Spark to map our data pipelines using MapReduce technology. We're also working on integrating tools like Hive with Apache Spark to distribute our data processing. We can also integrate other tools like Apache Kafka and Hadoop.

We faced some challenges when integrating the solution into our existing system, but good documentation helped solve them.

What needs improvement?

For improvement, I think the tool could make things easier for people who aren't very technical. There's a significant learning curve, and I've seen organizations give up because of it. Making it quicker or easier for non-technical people would be beneficial.

For how long have I used the solution?

I have been working with the product for five years. 

Buyer's Guide
Apache Spark
December 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: December 2024.
824,067 professionals have used our research since 2012.

What do I think about the stability of the solution?

Apache Spark is stable. 

What do I think about the scalability of the solution?

We're a big company with about 4 million consumers. We handle huge amounts of data—around 30,000 sensors send data every 15 minutes, which adds up to 5-10 terabytes per day.

Which solution did I use previously and why did I switch?

Before Apache Spark, we had a different solution - a traditional system with one server handling everything, more like a data warehouse. We switched to Apache Spark because we needed real-time visibility in our operations.

How was the initial setup?

The initial setup process was challenging. We tried to do it ourselves at first, but we weren't used to distributed computing systems, creating nodes, and distributing data. Later, we engaged consulting groups that specialized in it. This is why there's a specific learning curve—it would be challenging for a company to start alone.

The initial deployment took us about six to eight months. We started with three people involved in the deployment process and later increased to five. From a maintenance point of view, it's pretty smooth now. It's not difficult to maintain and doesn't require much maintenance.

What was our ROI?

The tool has helped us reduce costs that run into billions of dollars yearly. The ROI is very significant for us.

Which other solutions did I evaluate?

We did evaluate other options. We started by looking at open-source Hadoop deployment, thinking we'd bring data into HDFS and do machine learning separately. But that would have been a hassle, so Apache Spark was a better fit.

What other advice do I have?

I rate the overall solution a seven out of ten. I would recommend Apache Spark to other users, but it depends on their use cases. I advise new users to get an expert involved from the start.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
Suresh_Srinivasan - PeerSpot reviewer
Co-Founder at FORMCEPT Technologies
Real User
Top 10
Enables us to process data from different data sources
Pros and Cons
  • "We use Spark to process data from different data sources."
  • "In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."

What is our primary use case?

Our primary use case is for interactively processing large volume of data.

What is most valuable?

We use Spark to process data from different data sources. 

What needs improvement?

In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond

For how long have I used the solution?

I have been using Apache Spark for eight to nine years. 

What do I think about the stability of the solution?

It is a stable solution. The solution is ten out of ten on stability. 

What do I think about the scalability of the solution?

The solution is highly scalable. All of the technical guys use Spark. Our product is used by many people within our customers' company.

How was the initial setup?

The initial setup is straightforward. 

What's my experience with pricing, setup cost, and licensing?

The solution is moderately priced. 

What other advice do I have?

I rate the overall solution a ten out of ten. 

Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
Buyer's Guide
Apache Spark
December 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: December 2024.
824,067 professionals have used our research since 2012.
CTO at Hammerknife
Real User
Top 5
Provides a valuable implementation of distributed data processing with a simple setup process
Pros and Cons
  • "Apache Spark provides a very high-quality implementation of distributed data processing."
  • "There were some problems related to the product's compatibility with a few Python libraries."

What is our primary use case?

We use the product for real-time data analysis.

What is most valuable?

Apache Spark provides a very high-quality implementation of distributed data processing. I rate it 20 on a scale of one to ten.

What needs improvement?

There were some problems related to the product's compatibility with a few Python libraries. But I suppose they are fixed.

For how long have I used the solution?

We have been using Apache Spark for the last two to three years.

What do I think about the stability of the solution?

I rate the product's stability a ten out of ten.

What do I think about the scalability of the solution?

The product is enormously scalable.

How was the initial setup?

The initial setup process is simple if you are a good professional. You have to select a few parameters and press enter. It is already integrated into Databricks platform. One person is enough to manage small and medium implementations.

What's my experience with pricing, setup cost, and licensing?

It is an open-source platform. We do not pay for its subscription.

Which other solutions did I evaluate?

We are evaluating a few analytics engineering and DBT solutions. For now, Spark is in the secondary position.

What other advice do I have?

I recommend Apache Spark for batch analytics features.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer1759647 - PeerSpot reviewer
Information Technology Business Analyst at a aerospace/defense firm with 10,001+ employees
Real User
A highly scalable and affordable tool that can be used to gather information from different systems
Pros and Cons
  • "The product is useful for analytics."
  • "The product could improve the user interface and make it easier for new users."

What is most valuable?

We use it as an ETL tool to gather information from different systems. The product is useful for analytics.

What needs improvement?

The product could improve the user interface and make it easier for new users. It has a steep learning curve.

For how long have I used the solution?

I have been using the product for approximately three to four years. Currently, I am using the latest version.

What do I think about the stability of the solution?

The tool is stable. I rate the stability a ten out of ten.

What do I think about the scalability of the solution?

The tool is very scalable. I rate the scalability a ten out of ten. Approximately 30 users are using Apache Spark in our organization.

How are customer service and support?

We are using the free version of the product. So, we are not using any support.

How would you rate customer service and support?

Positive

How was the initial setup?

The basic installation is easy. However, we are working in the security business and need a very secure installation. It has been quite difficult. I rate the basic installation a ten out of ten. I rate the ease of setup a two or three out of ten for a more secure installation with all the security features. The solution is deployed on-premises in our organization. The deployment process requires a couple of weeks.

What's my experience with pricing, setup cost, and licensing?

We are using the free version of the solution.

What other advice do I have?

I would recommend the product. I think it's a good solution for analytics. Overall, I rate the product an eight out of ten.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
reviewer2150616 - PeerSpot reviewer
Lead Data Scientist at a transportation company with 51-200 employees
Real User
Top 5
Offers user-friendliness, clarity and flexibility
Pros and Cons
  • "The product's initial setup phase was easy."
  • "From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable."

What needs improvement?

The only issue I faced with the tool was that I used to choose the compute device to support parallel processing, and it has to be more like scaling up horizontally. The tool should be more scalable, not in terms of increasing the CPU or something, but more in the area of units. If two units are not enough, the third or fourth unit should be able to come into the picture.

From my perspective, the only thing that needs improvement is the interface, as it was not easily understandable. Sometimes, I get an error saying that it is an RDD-related error, and it becomes difficult to understand where it went wrong. When I deal with datasets using a library called Pandas in Python, I can actually apply functions on each column and get a transformation from the column. When I try to do the same thing with Apache Spark, it is okay and works, but it is not straightforward; I need to deal with it a little differently, and even after trying to do that differently, the problem I face there is, sometimes it will throw an error saying that it is looping back to the same, but I was not getting that kind of errors in Pandas.

In future updates, the tool should be made more user-friendly. I want to take fifty parallel processes rather than one, and I want to pick some particular columns to be split by partition, so if the tool is user-friendly and offers clarity and flexibility, then that will be good.

For how long have I used the solution?

I have been using Apache Spark for four years.

What do I think about the stability of the solution?

Stability-wise, I rate the solution a nine out of ten. The only issues with the tool revolve around user interaction and user flexibility.

What do I think about the scalability of the solution?

It is a scalable solution. Scalability-wise, I rate the solution an eight out of ten.

Around five people in my company use the tool.

How are customer service and support?

The solution's technical support is helpful, but I faced some problems which were more of a generic issue. If I face any problems which are non- generic issues, I get help from the tool's team. For the generic issues, I get answers mainly from the forums where the problem was already resolved. When it comes to some unknown problem or specific problem with my work, then the support takes time. I rate the technical support a seven out of ten.

How would you rate customer service and support?

Neutral

Which solution did I use previously and why did I switch?

I only work with Apache Spark.

How was the initial setup?

The product's initial setup phase was easy.

I managed the product's installation phase, both locally and on the cloud.

The solution is deployed on the on-premises version.

The solution can be deployed in two to three hours.

What was our ROI?

Apache Spark has helped save 50 percent of the operational costs. Time was reduced with the use of the tool, but the computing part increased. Overall, I can see that the tool's use has led to a 50 percent reduction in costs.

What's my experience with pricing, setup cost, and licensing?

I did not pay anything when using the tool on cloud services, but I had to pay on the compute side. The tool is not expensive compared with the benefits it offers. I rate the price as an eight out of ten.

Which other solutions did I evaluate?

Previously, I was more of a Python full-stack developer, and I was happy dealing with PySpark libraries, which gave me an edge in continuing the work with Apache.

What other advice do I have?

Speaking about Apache Spark's use in our company's data processing workflows, I would say that when we deal with large datasets of data, if we don't use Spark, then when we try to use a data frame consisting of one year of data, it used to take me 45 minutes to an hour. Moreover, sometimes I used to get the memory out of space errors, but such issues were avoided the moment I started using Apache Spark, as I was able to get the whole processing done in less than five minutes, and there were no memory issues.

For big data processing, the tool's parallel processing and time are areas that have been helpful. When I try to apply a function, I can directly data write one code. Basically, I used Apache Spark to forecast multiple units at the same time, and if not with Apache Spark, I would be doing that one by one, which is more of a serial processing process that used to take me around five hours. At the moment, we use Apache Spark in parallel processing, where computing happens parallelly, and all these computations are cut down by at least 90 percent. It helps me significantly to reduce the time needed for operations.

The tool's real-time processing is an area that I have not tried to use much. When it comes to real-time processing of my data, I use Kafka.

I am handling data governance using Databricks Unity Catalog.

When I try to apply an ML model, I am unable to get that model done on a table partitioned by a particular column; it makes me get the job done in a reduced number of partitions. If I go with five partitions, I am able to get at least three to four times the benefits in a lesser amount of time.

Regular maintenance exists, but it is not like I have to sit week by week and upgrade a patch or something like that. The maintenance is done mostly in about six months to a year.

I take care of the tool's maintenance.

I recommend the tool to others.

I rate the tool an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Flag as inappropriate
PeerSpot user
Oscar Estorach - PeerSpot reviewer
Chief Data-strategist and Director at Theworkshop.es
Real User
Top 10
Scalable, open-source, and great for transforming data
Pros and Cons
  • "The solution has been very stable."
  • "It's not easy to install."

What is our primary use case?

You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.

What is most valuable?

Overall, it's a very nice tool.

It is great for transforming data and doing micro-streamings or micro-batching.

The product offers an open-source version.

The solution has been very stable.

The scalability is good.

Apache Spark is a huge tool. It has many use cases and is very flexible. You can use it with so many other platforms. 

Spark, as a tool, is easy to work with as you can work with Python, Scala, and Java.

What needs improvement?

If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing.

It's not easy to install. You are typically dealing with a big data system.

It's not a simple, straightforward architecture. 

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The stability is very good. There are no bugs or glitches and it doesn't crash or freeze. It's a reliable solution. 

What do I think about the scalability of the solution?

We have found the scalability to be good. If your company needs to expand it, it can do so.

We have five people working on the solution currently.

How are customer service and technical support?

There isn't really technical support for open source. You need to do your own studying. There are lots of places to find information. You can find details online, or in books, et cetera. There are even courses you can take that can help you understand Spark.

Which solution did I use previously and why did I switch?

I also use Databricks, which I use in the cloud.

How was the initial setup?

When handling big data systems, the installation is a bit difficult. When you need to deploy the systems, it's better to use services like Databricks.

I am not a professional admin. I am a developer for and design architecture.

You can use it in your standalone system, however, it's not the best way. It would be okay for little branch codes, not for production.

What's my experience with pricing, setup cost, and licensing?

We use the open-source version. It is free to use. However, you do need to have servers. We have three or four. they can be on-premises or in the cloud. 

What other advice do I have?

I have the solution installed on my computer and on our servers. You can use it on-premises or as a SaaS.

I'd rate the solution at a nine out of ten. I've been very pleased with its capabilities. 

I would recommend the solution for the people who need to deploy projects with streaming. If you have many different sources or different types of data, and you need to put everything in the same place - like a data lake - Spark, at this moment, has the right tools. It's an important solution for data science, for data detectors. You can put all of the information in one place with Spark.

Which deployment model are you using for this solution?

On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
Armando Becerril - PeerSpot reviewer
Partner / Head of Data & Analytics at Intelligence Software Consulting
Real User
Top 5
Great for machine learning applications; good documentation available
Pros and Cons
  • "Provides a lot of good documentation compared to other solutions."
  • "The migration of data between different versions could be improved."

What is our primary use case?

We use Spark for machine learning applications, clustering, and segmentation of customers.

What is most valuable?

Apache provides a lot of good documentation compared to other solutions. 

What needs improvement?

The migration of data between different versions could be improved. 

For how long have I used the solution?

I've been using this solution for four years. 

What do I think about the stability of the solution?

The solution is stable. 

What do I think about the scalability of the solution?

The solution is scalable. 

How are customer service and support?

If you pay for customer support then you get a quick and efficient response, otherwise the community support offers good help. 

How was the initial setup?

The initial setup has been simplified over the past few years and is now relatively straightforward. 

What's my experience with pricing, setup cost, and licensing?

Licensing costs depend on where you source the solution. 

What other advice do I have?

This is a good solution for big data use cases and I rate it eight out of 10. 

Disclosure: I am a real user, and this review is based on my own experience and opinions.
PeerSpot user
PLC Programmer at Alzero
Real User
Top 20
Highly-recommended robust solution for data processing
Pros and Cons
  • "I appreciate everything about the solution, not just one or two specific features. The solution is highly stable. I rate it a perfect ten. The solution is highly scalable. I rate it a perfect ten. The initial setup was straightforward. I recommend using the solution. Overall, I rate the solution a perfect ten."
  • "The solution’s integration with other platforms should be improved."

What is our primary use case?

We are a software solutions company that serves a variety of industries, including banking, insurance, and industrial sectors. The product is specifically employed for managing data platforms for our customers.


What is most valuable?

The solution, as a package, excels across the board. I appreciate everything, not just one or two specific features.


What needs improvement?

The solution’s integration with other platforms should be improved.


For how long have I used the solution?

I have been using the solution for the past eight years. Currently, I’m using the latest version of the solution.


What do I think about the stability of the solution?

The solution is highly stable. I rate it a perfect ten.


What do I think about the scalability of the solution?

The solution is highly scalable. I rate it a perfect ten.


How was the initial setup?

The initial setup was straightforward and was conducted on the cloud. The entire deployment process took just 15 minutes. The deployment process involves provisioning the computational part tool using Terraform.


What's my experience with pricing, setup cost, and licensing?

The solution is affordable and there are no additional licensing costs.


What other advice do I have?

I recommend using the solution. Overall, I rate the solution a perfect ten.


Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
PeerSpot user