We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers.
Director of BigData Offer at IVIDATA
Stable, fast, and easy to use
Pros and Cons
- "The solution is very stable."
- "The solution needs to optimize shuffling between workers."
What is our primary use case?
What is most valuable?
It is a very fast solution. It's very easy to use. There are many RPis with many languages like Scala, Java, R, and Python. The greatest advantage of Spark is that we can initiate many kinds of analytics including SQL analytics, graphics analytics, etc.
What needs improvement?
The solution needs to optimize shuffling between workers.
For how long have I used the solution?
I've been using the solution for four or five years.
Buyer's Guide
Apache Spark
November 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2024.
816,406 professionals have used our research since 2012.
What do I think about the stability of the solution?
The solution is very stable.
What do I think about the scalability of the solution?
The solution is scalable. My understanding is version 3.0 has renewed scaling capabilities and will be able to do so automatically.
How are customer service and support?
Apache is an open-source platform so there is no technical support.
What other advice do I have?
We use both on-premises and public and private cloud deployment models. We're partners with Databricks.
I'm a consultant. Our company works for large enterprises such as banks and energy companies. 17 of our workers use Apache Spark.
With the cloud, there are many companies that integrate Spark. Most projects in big data around the world use Spark, indirectly or directly.
I'd rate the solution eight out of ten.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Works at a computer software company with 51-200 employees
Features include machine learning, real time streaming, and data processing. It doesn't enable spark job scheduling with monitoring capability.
Pros and Cons
- "Features include machine learning, real time streaming, and data processing."
- "The fault tolerant feature is provided."
- "It provides a scalable machine learning library."
- "It should support more programming languages."
- "Needs to provide an internal schedule to schedule spark jobs with monitoring capability."
What is our primary use case?
Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.
How has it helped my organization?
It provides a scalable machine learning library so that we can train and predict user behavior for promotion purposes.
What is most valuable?
Machine learning, real time streaming, and data processing are fantastic, as well as the resilient or fault tolerant feature.
What needs improvement?
I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.
For how long have I used the solution?
Trial/evaluations only.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Apache Spark
November 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2024.
816,406 professionals have used our research since 2012.
Sr. Software Engineer at a tech vendor with 1-10 employees
Helped us reduce 3TB Google Ngrams in hours instead of days
Pros and Cons
- "The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics."
- "More ML based algorithms should be added to it, to make it algorithmic-rich for developers."
What is most valuable?
The most valuable feature is the Fault Tolerance and easy binding with other processes like Machine Learning, graph analytics. The community is growing and hence executing ML in a distributed fashion is quite good.
How has it helped my organization?
Previously we were using Hadoop MapReduce to reduce the Google Ngrams (3TB), which took us approximately five days on our cluster. After using Spark, we were able to accomplish this task within hours.
What needs improvement?
This product is already improving as the community is developing it rapidly. More ML based algorithms should be added to it, to make it algorithmic-rich for developers.
For how long have I used the solution?
Two and a half years.
What do I think about the stability of the solution?
No, I did not encounter any problems with the stability. It is also quite backwards compatible.
What do I think about the scalability of the solution?
No I did not as of now, it is quite scalable. Using simple scripts you can add as many workers as you want.
What other advice do I have?
This is a very good product for the big data analytics and integrates well with other parts like Machine Learning and graph analytics.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Data Scientist at a tech vendor with 10,001+ employees
It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.
What is most valuable?
It allows the loading and investigation of very lard data sets, has MLlib for machine learning, Spark streaming, and both the new and old dataframe API.
How has it helped my organization?
We're able to perform data discovery on large datasets without too much difficulty.
What needs improvement?
It needs better documentation as well as examples for all the Spark libraries. That would be very helpful in maximizing its capabilities and results.
For how long have I used the solution?
I've used it for over nine months now.
What was my experience with deployment of the solution?
I haven't encountered any issues with deployment.
What do I think about the stability of the solution?
There have been no stability issues.
What do I think about the scalability of the solution?
I haven't had any scalability issues. It scales better than Python and R.
How are customer service and technical support?
Customer Service:
I haven't had to use customer service.
Technical Support:I haven't had to use technical support.
Which solution did I use previously and why did I switch?
I previously used Python and R, but neither of these scaled particularly well.
How was the initial setup?
The initial setup was complex. It was not easy getting the correct version and dependencies set up.
What about the implementation team?
I implemented it in-house on my own!
What was our ROI?
It's open-source, so ROI is inapplicable.
What other advice do I have?
Learn Scala as this will greatly reduce the pain in starting off with Spark.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Software Developer (Product Engineering) at a computer software company with 501-1,000 employees
We have been using Spark to do a lot of batch and stream processing of inbound data from Apache Kafka. Scaling Spark on YARN is still an issue but we are getting acceptable performance.
Valuable Features:
\Spark Streaming, Spark SQL and MLib in that order.
Improvements to My Organization:
We have been using Spark to do a lot of batch and stream processing of inbound data from Apache Kafka. Scaling Spark on YARN is still an issue but we are getting acceptable performance.
Room for Improvement:
Like I said scalability is still an issue, also stability. Spark on Yarn still doesn't seem to have programming submission api, so have to rely on spark-submit script to run jobs on YARN. Scala vs Java API have performance differences which will require sometimes to code in Scala.
Other Advice:
Have Scala developers at hand. Base Java competency will not be enough during optimization rounds.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Chief Technology Officer at a tech services company with 11-50 employees
Helpful support, easy to use, and high availability
Pros and Cons
- "The most valuable feature of Apache Spark is its ease of use."
- "Apache Spark can improve the use case scenarios from the website. There is not any information on how you can use the solution across the relational databases toward multiple databases."
What is our primary use case?
I am using Apache Spark for the data transition from databases. We have customers who have one database as a data lake.
What is most valuable?
The most valuable feature of Apache Spark is its ease of use.
What needs improvement?
Apache Spark can improve the use case scenarios from the website. There is not any information on how you can use the solution across the relational databases toward multiple databases.
For how long have I used the solution?
I have been using Apache Spark for approximately 18 months.
What do I think about the stability of the solution?
Apache Spark is stable.
What do I think about the scalability of the solution?
We are using Apache Spark across multiple nodes and it is scalable.
We have approximately five people using this solution.
How are customer service and support?
The technical support from Apache Spark is very good.
What other advice do I have?
I rate Apache Spark an eight out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Snr Security Engineer at a tech vendor with 201-500 employees
Provides security analytics and has good scalability
Pros and Cons
- "The scalability has been the most valuable aspect of the solution."
- "The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive."
What is our primary use case?
We primarily use the solution for security analytics.
What is most valuable?
The scalability has been the most valuable aspect of the solution.
What needs improvement?
The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive.
For how long have I used the solution?
I've been using the solution for three years.
What do I think about the stability of the solution?
The 2.3 version is quite stable. All of our customers use it, there are around 100,000+ users, and it runs 24/7.
What do I think about the scalability of the solution?
The scalability is very good.
How are customer service and technical support?
You actually buy Cloudera along with it. You don't really get any support, except you need support.
Which solution did I use previously and why did I switch?
In previous companies, we used MySQL platform and solutions like ArcSight and Splunk. We switched for scalability. MySQL wasn't going to scale, and we don't use Splunk at this company.
How was the initial setup?
The initial setup was complex. It is a complex tool. It's a lot to do with how you will use it. There is a lot to set up. They need to put a lot of scripts to it. There's nearly 60 to set up. When you set up the cloud, it takes about a day to set up. If you set it up on-premise, you know, on hardware, it only takes about a week.
What other advice do I have?
I would rate this solution eight out of 10.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Lead Consultant at a tech services company with 51-200 employees
The data storage capacity means we can inject somewhere in the user database in more efficient ways
Pros and Cons
- "The main feature that we find valuable is that it is very fast."
- "We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time."
What is most valuable?
The main feature that we find valuable is that it is very fast. In terms of big data, the main feature is that the data is in so many different nodes. It goes through many data nodes so whenever we use the data, it enables us to parse the data from different data nodes.
What needs improvement?
We use big data manager but we cannot use it as conditional data so whenever we're trying to fetch the data, it takes a bit of time. There is some latency in the system and latency in the data caching. The main issue is that we need to design it in a way that data will be available to us very quickly. It takes a long time and the latest data should be available to us much quicked.
What do I think about the stability of the solution?
We don't have any problems with stability.
How are customer service and technical support?
I'm not the one who would contact their support if we needed it.
How was the initial setup?
The initial setup is straightforward.
What other advice do I have?
The advice that I would give to someone considering this solution is that the quality of data has key streaming capabilities like velocity. This means how quickly you are going to refer to the data. These things matter by designing the solution. We need to take these things out.
I would rate Apache Spark an eight out of ten.
To make it a ten they should improve the speed. The data storage capacity means we can inject somewhere in the user database in more efficient ways.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Updated: November 2024
Popular Comparisons
Amazon EMR
Cloudera Distribution for Hadoop
Spark SQL
IBM Spectrum Computing
Hortonworks Data Platform
Informatica Big Data Parser
IBM Db2 Big SQL
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which is the best RDMBS solution for big data?
- Apache Spark without Hadoop -- Is this recommended?
- Which solution has better performance: Spring Boot or Apache Spark?
- AWS EMR vs Hadoop
- Handling real and fast data - how do BigInsight and other solutions perform?
- When evaluating Hadoop, what aspect do you think is the most important to look for?
- Should we choose InfoSphere BigInsights or Cloudera?