I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.
Director of Enginnering at Sigmoid
Easy to code, fast, open-source, very scalable, and great for big data
Pros and Cons
- "Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica. Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark."
- "Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."
What is our primary use case?
How has it helped my organization?
Spark has been at the forefront of data processing engine. I have used Apache Spark for multiple projects for different clients. It is an excellent tool to process massive amount of data.
What is most valuable?
Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica.
Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark.
What needs improvement?
Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.
Buyer's Guide
Apache Spark
December 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: December 2024.
824,067 professionals have used our research since 2012.
For how long have I used the solution?
I have been using this solution for around 7 years.
What do I think about the stability of the solution?
There were bugs three to four years ago, which have been resolved. There were a couple of issues related to slowness when we did a lot of transformations using the Width columns. I was writing a POC on ETL for moving from Informatica to Spark SQL for the ETL pipeline. It required the use of hundreds of Width columns to change the column name or add some transformation, which made it slow. It happened in versions prior to version 1.6, and it seems that this issue has been fixed later on.
What do I think about the scalability of the solution?
It is very scalable. You can scale it a lot.
How are customer service and support?
I haven't contacted them.
How was the initial setup?
The initial setup was a little complex when I was using open-source Spark. I was doing a POC in the on-premise environment, and the initial setup was a little cumbersome. It required a lot of set up on Unix systems. We also had to do a lot of configurations and install a lot of things.
After I moved to the Cloudera CDH version, it was a little easy. It is a bundled product, so you just install whatever you want and use it.
What's my experience with pricing, setup cost, and licensing?
Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera.
What other advice do I have?
I would definitely recommend Spark. It is a great product. I like Spark a lot, and most of the features have been quite good. Its initial learning curve is a bit high, but as you learn it, it becomes very easy.
I would rate Apache Spark an eight out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Senior Test Automation Consultant / Architect at a tech services company with 11-50 employees
Useful for big data and scientific purposes, but needs better query handling, stability, and scalability
Pros and Cons
- "It is useful for handling large amounts of data. It is very useful for scientific purposes."
- "We are building our own queries on Spark, and it can be improved in terms of query handling."
What is our primary use case?
We are using it for big data. We are using a small part of it, which is related to using data.
What is most valuable?
It is useful for handling large amounts of data. It is very useful for scientific purposes.
What needs improvement?
There are some difficulties that we are working on. It is useful for scientific purposes, but for commercial use of big data, it gives some trouble.
They should improve the stability of the product. We use Spark Executors and Spark Drivers to link to our own environment, and they are not the most stable products. Its scalability is also an issue.
We are building our own queries on Spark, and it can be improved in terms of query handling.
For how long have I used the solution?
In my company, it has been used for several years, but I have been using it for seven months.
What do I think about the scalability of the solution?
It is not scalable. Scalability is one of the issues.
How are customer service and support?
It is open source from my point of view. So, there is no support.
What other advice do I have?
I would advise not using it if you don't have experienced users inside your organization. If you have to figure it all out on your own, then you shouldn't start with it.
Overall, I would rate it a six out of 10. For a commercial use case, it is a six out of 10. For scientific purposes, it is an eight out of 10.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Apache Spark
December 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: December 2024.
824,067 professionals have used our research since 2012.
Principal Architect at a financial services firm with 1,001-5,000 employees
Fast performance and has an easy initial setup
Pros and Cons
- "I found the solution stable. We haven't had any problems with it."
- "It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster."
What is our primary use case?
We use the solution for analytics.
How has it helped my organization?
I'm not sure how it has improved my organization but I believe that it's a good product.
What is most valuable?
The fast performance is the most valuable aspect of the solution.
What needs improvement?
The search could be improved. Usually, we are using other tools to search for specific stuff. We'll be using it how I use other tools - to get the details, but if there any way to search for little things that will be better.
It needs a new interface and a better way to get some data. In terms of writing our scripts, some processes could be faster.
In the next release, if they can add more analytics, that would be useful. For example, for data, built data, if there was one port where you put the high one then you can pull any other close to you, and then maybe a log for the right script.
For how long have I used the solution?
I've been using the solution for two years.
What do I think about the stability of the solution?
I found the solution stable. We haven't had any problems with it.
How are customer service and technical support?
Usually, we can fix any issues. If we have problems, we google a little bit to find the issue.
Which solution did I use previously and why did I switch?
I was using some other systems and we moved to Spark later. We faced performance and other issues with the other solution.
How was the initial setup?
The initial setup was easy. We keep on getting data from different sources so we will keep on porting in little bits. It's not done in a single sitting, so I can't really say how long it takes.
What other advice do I have?
I would recommend the solution. I would rate it an eight or nine out of 10.
For some areas, I would give it ten but I cannot use some parts. If you are going to use it for a consumer then I would be able to recommend it and you should go ahead. It doesn't work for me as I have different clients and different engagements.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Manager - Data Science Competency at a tech services company with 201-500 employees
Fast-performance, cost-effective, and runs in a cloud-agnostic environment
Pros and Cons
- "One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
- "When you are working with large, complex tasks, the garbage collection process is slow and affects performance."
What is our primary use case?
My main task is working on predictive analytics, and Apache Spark is one of the tools that I utilize in this role. Primarily, we work with the predictive analysis of very large amounts of data.
Apache Spark is also helpful for data pre-processing, including data cleaning.
This solution is cloud-agnostic. You can use it with an EC2 instance and you can even install it on-premises. Some environments have it installed in VMs.
What is most valuable?
One of the key features is that Apache Spark is a distributed computing framework. You can have multiple slaves and distribute the workload between them.
Another feature is memory-based computing. This is unlike Hadoop, which relies on storage. As it uses in-memory data processing, Spark is very fast.
What needs improvement?
When you are working with large, complex tasks, the garbage collection process is slow and affects performance. This is an area where they need to improve because your job may fail if it is stuck for a long time while memory garbage collection is happening. This is the main problem that we have.
For how long have I used the solution?
I have been working with Apache Spark for the past four years.
What do I think about the stability of the solution?
This product is pretty stable. Companies like Facebook, Uber, and Netflix are all using Apache Spark. It's stable enough to be used all over the world.
What do I think about the scalability of the solution?
In our team that works on this, we have approximately 10 people.
How are customer service and support?
There is no official support for this solution. Because it's open-source and there is no cost involved, there is nobody to contact for support. Our own internal team of experts, which work on different problems, both support and contribute to the platform.
Which solution did I use previously and why did I switch?
I work on several open-source frameworks including Python, Scikit-learn, TensorFlow, PyTorch, H20.ai, and R. We don't endorse proprietary tools so we aren't working with them.
How was the initial setup?
With respect to the initial setup, it's neither easy nor very difficult. Our team has experience so it is not difficult for them. However, for a person that is new to using it, the setup might be very difficult.
What about the implementation team?
We have a team of experts in my company, and they handle it very well.
What's my experience with pricing, setup cost, and licensing?
This is an open-source tool, so it can be used free of charge. There is no cost involved.
What other advice do I have?
We are not using the current version of this platform, Spark 3. However, we do know that it is used in the market and it has new features. We will eventually move to it.
My advice for anybody who wants to use Apache Spark is that they have two options. The first is Databricks, which are the creators of Apache Spark, and use their proprietary version. If you choose this option then you will have to pay for the product.
If instead, you use Apache Spark, then you can rely on your own expert in-house team for support, maintenance, and deployment. In this option, you don't have to pay anything to anybody outside of your company.
I would rate this solution an eight out of ten.
Which deployment model are you using for this solution?
Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Big Data Consultant at a tech services company with 501-1,000 employees
We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.
Pros and Cons
- "The good performance. The nice graphical management console. The long list of ML algorithms."
- "Apache Spark provides very good performance The tuning phase is still tricky."
What is most valuable?
The good performance. The nice graphical management console. The long list of ML algorithms.
How has it helped my organization?
We are able to solve problems, e.g., reporting on big data, that we were not able to tackle in the past.
What needs improvement?
Apache Spark provides very good performance The tuning phase is still tricky.
For how long have I used the solution?
I've used it for 2 years.
What was my experience with deployment of the solution?
We didn't have an issue with the deployment.
What do I think about the stability of the solution?
In the past we deployed Spark 1.3 to use Spark SQL but unfortunately one of our queries failed because of a bug fixed in following releases. Then we moved to Spark 1.6 but still some queries were failing when run against huge datasets. Now we are using version 2.1: it is more stable, it ensures better performances and the SQL/ML parts are reacher than before.
What do I think about the scalability of the solution?
I've had no issues with the scalability.
How is customer service and technical support?
Customer Service:
I've never had to use customer service.
Technical Support:I've never had to use technical support.
How was the initial setup?
The initial set-up is quite complex because you have to set-up many different configuration parameters that are deployment-specific. It is not trivial to set-up the correct configuration with so many variables involved.
What about the implementation team?
In-house team. The setup itself is not a problem when you have just to test the system. The challenging part is discovering the optimal configuration needed to obtain a production system proving good performance.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Engineer at a tech vendor with 10,001+ employees
Spark provides lots of high-level APIs, which reduces duplication of work.
Valuable Features
Streaming data processing
Improvements to My Organization
In the previous version, we use Storm to handle real-time data, however its performance doesn't meet the requirement. Spark Streaming's micro-batch mode helps improving performance. Also, Spark provides lots of high-level APIs, which reduces duplication of work.
Room for Improvement
Better monitoring ability. Especially monitoring integration with customer codes.
Use of Solution
I've used it for one year.
Stability Issues
We met some standalone deployment issues, which showed that its stability is not that good. So we plan to switch to Yarn or Mesos mode
Customer Service and Technical Support
I have to say it is bad. I can only ask for help in the Google group. However, it is run in the developer-for-developer style. There are almost no people from databricks. I also use a Cassandra-Spark-connector, and Datastax has at least one dedicated person to help the community.
Initial Setup
Not that straightforward in terms of standalone deployment, there are some tricks which are not mentioned in the docs.
Implementation Team
We did it in-house.
Pricing, Setup Cost and Licensing
So far we have no plan to switch to commercial license.
Other Advice
I love Spark over other solutions.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Big Data and Cloud Solution Consultant at a financial services firm with 10,001+ employees
Provides flexibility for application creation with less coding effort
Pros and Cons
- "DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort."
- "Dynamic DataFrame options are not yet available."
What is most valuable?
DataFrame: Spark SQL gives the leverage to create applications more easily and with less coding effort.
How has it helped my organization?
We developed a tool for data ingestion from HDFS->Raw->L1 layer with data quality checks, putting data to elastic search, performing CDC.
What needs improvement?
Dynamic DataFrame options are not yet available.
For how long have I used the solution?
One and a half years.
What do I think about the stability of the solution?
No.
What do I think about the scalability of the solution?
No.
What other advice do I have?
Spark gives the flexibility for developing custom applications.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Lead Big Data Engineer at a non-profit with 51-200 employees
I use it to process large amount of data in the energy industry.
What is most valuable?
Spark is relatively easy to deploy, with rich features in handling big data. Spark Core, Spark SQL, Spark MLlib are used mostly in our applications.
How has it helped my organization?
I use Spark to process large amount of data in the energy industry.
What needs improvement?
Good tool to analyse Spark application performance. Right now there are still many parameters to tune in order to get good performance of Spark application, I would like to see the auto tuning of parameters.
For how long have I used the solution?
I've been using Spark for seven months.
What was my experience with deployment of the solution?
There were no issues with the deployment.
What do I think about the stability of the solution?
I ran into Spark application performance issues. For instance, Spark JDBC write performance needs to be improved.
What do I think about the scalability of the solution?
There were no issues with the scalability.
How are customer service and technical support?
Customer Service:
I use Apache open source. Everything is on our own.
Technical Support:I use Apache open source. Everything is on our own.
Which solution did I use previously and why did I switch?
I evaluated Hadoop-based solution, and chose Spark due to the fast processing and ease of use.
How was the initial setup?
The initial setup is not complex. The online documents are pretty good.
What about the implementation team?
I implemented it in-house.
What other advice do I have?
Get to know how Spark works, what are job, stage, task, DAG, etc., and it will help you to write Spark application.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Updated: December 2024
Popular Comparisons
Amazon EMR
Cloudera Distribution for Hadoop
Spark SQL
IBM Spectrum Computing
Hortonworks Data Platform
Informatica Big Data Parser
IBM Db2 Big SQL
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which is the best RDMBS solution for big data?
- Apache Spark without Hadoop -- Is this recommended?
- Which solution has better performance: Spring Boot or Apache Spark?
- AWS EMR vs Hadoop
- Handling real and fast data - how do BigInsight and other solutions perform?
- When evaluating Hadoop, what aspect do you think is the most important to look for?
- Should we choose InfoSphere BigInsights or Cloudera?