The most important feature of Apache Spark is that it provides large scale data processing with negligible latency at the cost of commodity hardwares. Spark framework is just a blessings over Hadoop, as the later does not allow fast processing of data, which is accomplished by the in-memory data processing of Spark.
Software Consultant at a tech services company with 10,001+ employees
It provides large scale data processing with negligible latency at the cost of commodity hardwares.
What is most valuable?
How has it helped my organization?
Apache Spark is a framework, which allows one organization to perform business & data analytics, at a very low cost, as compared to Ab-Initio or Informatica. Thus, by using Apache Spark in place of those tools, one organization can achieve huge reduction in cost, & without compromising with any data security & other data related issues, if controlled by an expert Scala programmer & Apache Spark does not bear the overheads of Hadoop of having high latency. All these points, by which my organization is being benefitted as well.
What needs improvement?
Question of improvement always comes to mind of the developers. Just like the most common need of the developers, if a user-friendly GUI along with 'drag & drop' feature can be attached to this framework, then it would be easier to access it.
Another thing to mention, there always is a place for improvement in terms of the memory usage. If in future, it is achievable to use less memory for processing, it would obviously be better.
What was my experience with deployment of the solution?
We've had no issues with deployment.
Buyer's Guide
Apache Spark
November 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2024.
816,406 professionals have used our research since 2012.
What do I think about the stability of the solution?
See above regarding memory usage.
What do I think about the scalability of the solution?
We've had no issues with scalability.
What other advice do I have?
My advice to others would be just to use Apache Spark for large scale data processing, as it provides good performance at low cost, unlike Ab-Initio or Informatica. But the main problem is, now in the market, there are not many people certified in Apache Spark.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Chief System Architect at a marketing services firm with 501-1,000 employees
Spark gives us the ability to run queries on MySQL database without pressurising our database
What is most valuable?
With spark SQL we've now the capabilities to analyse very large quantities of data located in S3 on Amazon at very low cost comparing other solution we checked.
We also use our own Spark cluster to aggregate data on near real time and save the result on MySQL database.
We've started new projects using the machine learning library ML.
How has it helped my organization?
Until Spark we didn't have the ability to analyse this quantity of data we're talking about two TB/hour. So we're now able to produce a lot of reports, and are also able to develop machine learning based analysis to optimize our business.
We've central access to every piece of data in the company including finance, business, debug etc. and the ability to join all this data together.
What needs improvement?
Spark is actually very good for batch analysis much more good than Hadoop, it's much simple, much more quicker etc., but it actually lacks the ability to perform real-time querying like Vertica or Redshift.
Also, it is more difficult for an end user to work with Spark than normal database. even comparing with analytic database like Vertica or Redshift.
For how long have I used the solution?
We're now using Spark-Streaming and Spark-SQL for almost 2 years.
What was my experience with deployment of the solution?
We're working on AWS so we need to have a managed environment. We've choose to go with a solution based on Chef to deploy and configure the spark clusters. Tip : if you don't have any devops you can use the ec2 script (provided by spark distro) to deploy cluster on amazon. We've tested it and work perfectly.
What do I think about the stability of the solution?
Spark Streaming is difficult to stabilize as you're always dependant to your stream flow. If you start to be late on the consumer you've a serious problem. We've encountered a lot of stability issue to configure it as expected
What do I think about the scalability of the solution?
It's linked to stability in our case it's takes time to evaluate what is the correct size of the cluster you need. It's very important to always add to you jobs monitoring to be able to understand what's the problem. We use datadog as monitoring platform
Which solution did I use previously and why did I switch?
Yes to make this job we've used a MySQL database. We switch because MySQL is not a scalable solution and we've reach it's limits.
How was the initial setup?
Setup a spark cluster can be difficult. it's related to your clustering strategy. There is 4 solution at least.
ec2 script : work only on Amazon AWS
Standalone : manually configuration (hard)
Yarn : to leverage your already existing Hadoop environment.
Mesos : to use with your other Mesos ready application
What about the implementation team?
We use Databricks as online DB ad hoc query. It's work on AWS as managed service, it manage for you the cluster creation, configuration and monitoring.
Give a notebook oriented user interface to query any data source using Spark: DB, Parquet, CSV, Avro etc...
Which other solutions did I evaluate?
Yes we've started to evaluate analytics databases : vertica, exasol, and other for all the them the price was an issue regarding the quantity of data we want to manipulate.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Apache Spark
November 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2024.
816,406 professionals have used our research since 2012.
Lecturer at Amirkabir University of Technology
A scalable solution that can grow with the needs of a business, and provides excellent functionality for analytical tasks
Pros and Cons
- "This solution provides a clear and convenient syntax for our analytical tasks."
- "This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed."
What is our primary use case?
We use this solution for it's anti-money laundering and direct marketing features within a banking environment.
What is most valuable?
This solution provides a clear and convenient syntax for our analytical tasks.
What needs improvement?
This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed.
There is also limited Python compatibility, which should be improved.
For how long have I used the solution?
We have used this solution for around seven years, through several versions.
What do I think about the stability of the solution?
We have found this solution to be stable during our time using it.
What do I think about the scalability of the solution?
This is a very scalable solution from our experience.
What about the implementation team?
We implemented the solution using our in-house team, but the UI was developed using a third party contractor.
What's my experience with pricing, setup cost, and licensing?
The deployment time of this solution is dependent on the requirements of an organization, and the compatibility of the systems they will be using alongside this solution. We would recommend that these are clearly defined when designing the product for the businesses needs.
What other advice do I have?
I would rate this solution a nine out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Big Data Engineer Consultant at Collective[i]
Scala-based solution with good data evaluation functions and distribution
Pros and Cons
- "Spark can handle small to huge data and is suitable for any size of company."
- "Spark could be improved by adding support for other open-source storage layers than Delta Lake."
What is our primary use case?
I mainly use Spark to prepare data for processing because it has APIs for data evaluation.
What is most valuable?
The most valuable feature is that Spark uses Scala, which has good data evaluation functions. Spark also supports good distribution on the clusters and provides optimization on the APIs.
What needs improvement?
Spark could be improved by adding support for other open-source storage layers than Delta Lake. The UI could also be enhanced to give more data on resource management.
For how long have I used the solution?
I've been using Spark for six years.
What do I think about the stability of the solution?
Generally, Spark works correctly without any errors. It may give out some errors if your data changes, but in that case, it's a problem with the configuration, not with Spark.
What do I think about the scalability of the solution?
The cloud version of Spark is very easy to scale.
How was the initial setup?
The initial setup is not complex, but it depends on the product's component on the architecture. For example, if you use Hadoop, setup may not be easy. Deployment takes about a week, but the Spark cluster can be installed in the virtual architecture in a day.
What other advice do I have?
Spark can handle small to huge data and is suitable for any size of company. I would rate Spark as eight out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
Director at Nihil Solutions
Stable and easy to set up with a very good memory processing engine
Pros and Cons
- "The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly."
- "The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate."
What is our primary use case?
When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.
What is most valuable?
The memory processing engine is the solution's most valuable aspect. It processes everything extremely fast, and it's in the cluster itself. It acts as a memory engine and is very effective in processing data correctly.
What needs improvement?
There are lots of items coming down the pipeline in the future. I don't know what features are missing. From my point of view, everything looks good.
The graphical user interface (UI) could be a bit more clear. It's very hard to figure out the execution logs and understand how long it takes to send everything. If an execution is lost, it's not so easy to understand why or where it went. I have to manually drill down on the data processes which takes a lot of time. Maybe there could be like a metrics monitor, or maybe the whole log analysis could be improved to make it easier to understand and navigate.
There should be more information shared to the user. The solution already has all the information tracked in the cluster. It just needs to be accessible or searchable.
For how long have I used the solution?
I started using the solution about four years ago. However, it's been on and off since then. I would estimate in total I have about a year and a half of experience using the solution.
What do I think about the stability of the solution?
The stability of the solution is very, very good. It doesn't crash or have glitches. It's quite reliable for us.
What do I think about the scalability of the solution?
The scalability of the solution is very good. If a company has to expand it, they can do so.
Right now, we have about six or seven users that are directly on the product. We're encouraging them to use more data. We do plan to increase usage in the future.
How are customer service and technical support?
I'm a developer, so I don't interact directly with technical support. I can't speak to the quality of their service as I've never directly dealt with them.
Which solution did I use previously and why did I switch?
We did previously use a lot of different mechanisms, however, we needed something that was good at processing data for analytical purposes, and this solution fit the bill. It's a very powerful tool. I haven't seen other tools that could do precisely what this one does.
How was the initial setup?
The initial setup isn't too complex. It's quite straightforward.
We use CACD DevOps from deployment. We only use Spark for processing and for the Data Bricks cluster to spin off and do the job. It's continuously running int he background.
There isn't really any maintenance required per se. We just click the button and it comes up automatically, with the whole cluster and the Spark and everything ready to go.
What's my experience with pricing, setup cost, and licensing?
I'm unsure as to how much the licensing is for the solution. It's not an aspect of the product I deal with directly.
What other advice do I have?
We're customers and also partners with Apache.
While we are on version 2.6, we are considering upgrading to version 3.0.
I'd rate the solution nine out of ten. It works very well for us and suits our purposes almost perfectly.
Which deployment model are you using for this solution?
On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
Technical Consultant at a tech services company with 1-10 employees
Good Streaming features enable to enter data and analysis within Spark Stream
Pros and Cons
- "I feel the streaming is its best feature."
- "When you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources."
What is our primary use case?
We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.
What is most valuable?
I have worked with Hadoop a lot in my career and you need to do a lot of things to get it to Hello World. But in Spark it is easy. You could say it's an umbrella to do everything under the one shelf. It also has Spark Streaming. I feel the streaming is its best feature because I have extracted to enter data and analysis within Spark Stream.
What needs improvement?
I think for IT people it is good. The whole idea is that Spark works pretty easily, but a lot of people, including me, struggle to set things up properly. I like contributions and if you want to connect Spark with Hadoop its not a big thing, but other things, such as if you want to use Sqoop with Spark, you need to do the configuration by hand. I wish there would be a solution that does all these configurations like in Windows where you have the whole solution and it does the back-end. So I think that kind of solution would help. But still, it can do everything for a data scientist.
Spark's main objective is to manipulate and calculate. It is playing with the data. So it has to keep doing what it does best and let the visualization tool do what it does best.
Overall, it offers everything that I can imagine right now.
For how long have I used the solution?
I have been using Apache Spark for a couple of months.
What do I think about the stability of the solution?
In terms of stability, I have not seen any bugs, glitches or crashes. Even if there is, that's fine, because I would probably take care of it and then I'd have progressed further in the process.
What do I think about the scalability of the solution?
I have not tested the scalability yet.
In my company, there are two or three people that are using it for different products. But right now, the client I'm engaged with doesn't know anything about Spark or Hadoop. They are a typical financial company so they do what they do, and they ask us to do everything. They have pretty much outsourced their whole big data initiative to us.
Which solution did I use previously and why did I switch?
I have used MapReduce from Hadoop previously. Otherwise, I haven't used any other big data infrastructure.
In my work previously, not in this company, I was working with some big data, but I was extracting using a single-core off my PC. I realized over time that my system had eight cores. So instead, I used all of those cores for multi-core programming. Then I realized that Hadoop and Spark do the same thing but with different PC's. That was then I used multi-core programming and that's the point - Spark needs to go and search Hadoop and other things.
How was the initial setup?
The initial setup to get it to Hello World is pretty easy, you just have to install it. But when you want to extract data from your HDFS and other sources then it is kind of tricky because you have to connect with those sources. But you can get a lot of help from different sources on the internet. So it's great. A lot of people are doing it.
I work with a startup company. You know that in startups you do not have the luxury of different people doing different things, you have to do everything on your own, and it's an opportunity to learn everything. In a typical corporate or big organization you only have restricted SOPs, you have to work within the boundaries. In my organization, I have to set up all the things, configure it, and work on it myself.
What's my experience with pricing, setup cost, and licensing?
I would suggest not to try to do everything at once. Identify the area where you want to solve the problem, start small and expand it incrementally, slowly expand your vision. For example, if I have a problem where I need to do streaming, just focus on the streaming and not on the machine learning that Spark offers. It offers a lot of things but you need to focus on one thing so that you can learn. That is what I have learned from the little experience I have with Spark. You need to focus on your objective and let the tools help you rather than the tools drive the work. That is my advice.
What other advice do I have?
On a scale of 1 to 10, I'd put it at an eight.
To make it a perfect 10 I'd like to see an improved configuration bot. Sometimes it is a nightmare on Linux trying to figure out what happened on the configuration and back-end. So I think installation and configuration with some other tools. We are technical people, we could figure it out, but if aspects like that were improved then other people who are less technical would use it and it would be more adaptable to the end-user.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Manager | Data Science Enthusiast | Management Consultant at a consultancy with 5,001-10,000 employees
We can now harness richer data sets and benefit from use cases
Pros and Cons
- "With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware."
- "Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing."
How has it helped my organization?
Organisations can now harness richer data sets and benefit from use cases, which add value to their business functions.
What is most valuable?
Distributed in memory processing. Some of the algorithms are resource heavy and executing this requires a lot of RAM and CPU. With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.
What needs improvement?
Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.
For how long have I used the solution?
Three to five years.
What do I think about the stability of the solution?
At times when users do not know how to use Spark and request a lot of resources, then the underlying JVMs can crash, which is a big sense of worry.
What do I think about the scalability of the solution?
No issues.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
CEO & Founder at Xautomata
Reduces startup time and gives excellent ROI
Pros and Cons
- "Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term."
- "The initial setup was not easy."
What is our primary use case?
I use Spark to run automation processes driven by data.
How has it helped my organization?
Apache Spark helped us with horizontal scalability and cost optimizations.
What is most valuable?
The most valuable feature is the grid computing.
What needs improvement?
An area for improvement is that when we start the solution and declare the maximum number of nodes, the process is shared, which is a problem in some cases. It would be useful to be able to change this parameter in real-time rather than having to stop the solution and restart with a higher number of nodes.
For how long have I used the solution?
I've been using Spark for around four years.
How was the initial setup?
The initial setup was not easy, but we created a means of asking the user about their needs, making the setup much easier. We can now deploy the platform in thirty minutes using the public cloud or Kubernetes space.
What was our ROI?
Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term.
What's my experience with pricing, setup cost, and licensing?
Spark is an open-source solution, so there are no licensing costs.
What other advice do I have?
I would rate Apache Spark eight out of ten.
Which deployment model are you using for this solution?
Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Updated: November 2024
Popular Comparisons
Amazon EMR
Cloudera Distribution for Hadoop
Spark SQL
IBM Spectrum Computing
Hortonworks Data Platform
Informatica Big Data Parser
IBM Db2 Big SQL
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which is the best RDMBS solution for big data?
- Apache Spark without Hadoop -- Is this recommended?
- Which solution has better performance: Spring Boot or Apache Spark?
- AWS EMR vs Hadoop
- Handling real and fast data - how do BigInsight and other solutions perform?
- When evaluating Hadoop, what aspect do you think is the most important to look for?
- Should we choose InfoSphere BigInsights or Cloudera?
The drag and drop GUI comment is very true. We developed such a GUI for spatial and time series data in Spark. But there are other tools out there. Maybe you should do a review of such tools.