You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.
Chief Data-strategist and Director at Theworkshop.es
Scalable, open-source, and great for transforming data
Pros and Cons
- "The solution has been very stable."
- "It's not easy to install."
What is our primary use case?
What is most valuable?
Overall, it's a very nice tool.
It is great for transforming data and doing micro-streamings or micro-batching.
The product offers an open-source version.
The solution has been very stable.
The scalability is good.
Apache Spark is a huge tool. It has many use cases and is very flexible. You can use it with so many other platforms.
Spark, as a tool, is easy to work with as you can work with Python, Scala, and Java.
What needs improvement?
If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing.
It's not easy to install. You are typically dealing with a big data system.
It's not a simple, straightforward architecture.
For how long have I used the solution?
I've been using the solution for three years.
Buyer's Guide
Apache Spark
November 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2024.
816,406 professionals have used our research since 2012.
What do I think about the stability of the solution?
The stability is very good. There are no bugs or glitches and it doesn't crash or freeze. It's a reliable solution.
What do I think about the scalability of the solution?
We have found the scalability to be good. If your company needs to expand it, it can do so.
We have five people working on the solution currently.
How are customer service and support?
There isn't really technical support for open source. You need to do your own studying. There are lots of places to find information. You can find details online, or in books, et cetera. There are even courses you can take that can help you understand Spark.
Which solution did I use previously and why did I switch?
I also use Databricks, which I use in the cloud.
How was the initial setup?
When handling big data systems, the installation is a bit difficult. When you need to deploy the systems, it's better to use services like Databricks.
I am not a professional admin. I am a developer for and design architecture.
You can use it in your standalone system, however, it's not the best way. It would be okay for little branch codes, not for production.
What's my experience with pricing, setup cost, and licensing?
We use the open-source version. It is free to use. However, you do need to have servers. We have three or four. they can be on-premises or in the cloud.
What other advice do I have?
I have the solution installed on my computer and on our servers. You can use it on-premises or as a SaaS.
I'd rate the solution at a nine out of ten. I've been very pleased with its capabilities.
I would recommend the solution for the people who need to deploy projects with streaming. If you have many different sources or different types of data, and you need to put everything in the same place - like a data lake - Spark, at this moment, has the right tools. It's an important solution for data science, for data detectors. You can put all of the information in one place with Spark.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Partner / Head of Data & Analytics at Intelligence Software Consulting
Great for machine learning applications; good documentation available
Pros and Cons
- "Provides a lot of good documentation compared to other solutions."
- "The migration of data between different versions could be improved."
What is our primary use case?
We use Spark for machine learning applications, clustering, and segmentation of customers.
What is most valuable?
Apache provides a lot of good documentation compared to other solutions.
What needs improvement?
The migration of data between different versions could be improved.
For how long have I used the solution?
I've been using this solution for four years.
What do I think about the stability of the solution?
The solution is stable.
What do I think about the scalability of the solution?
The solution is scalable.
How are customer service and support?
If you pay for customer support then you get a quick and efficient response, otherwise the community support offers good help.
How was the initial setup?
The initial setup has been simplified over the past few years and is now relatively straightforward.
What's my experience with pricing, setup cost, and licensing?
Licensing costs depend on where you source the solution.
What other advice do I have?
This is a good solution for big data use cases and I rate it eight out of 10.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Apache Spark
November 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: November 2024.
816,406 professionals have used our research since 2012.
Co-Founder at FORMCEPT Technologies
Handles large volume data, cloud and on-premise deployments, but difficult to use
Pros and Cons
- "Apache Spark can do large volume interactive data analysis."
- "Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn."
What is our primary use case?
The solution can be deployed on the cloud or on-premise.
How has it helped my organization?
We are using Apache Spark, for large volume interactive data analysis.
MechBot is an enterprise, one-click installation, trusted data excellence platform. Underneath, I am using Apache Spark, Kafka, Hadoop HDFS, and Elasticsearch.
What is most valuable?
Apache Spark can do large volume interactive data analysis.
What needs improvement?
Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.
For how long have I used the solution?
I have been using Apache Spark for approximately 11 years.
What do I think about the stability of the solution?
The solution is stable.
What do I think about the scalability of the solution?
Apache Spark is scalable. However, it needs enormous technical skills to make it scalable. It is not a simple task.
We have approximately 20 people using this solution.
How was the initial setup?
If you want to distribute Apache Spark in a certain way, it is simple. Not every engineer can do it. You need DevOps specialized skills on Spark is what is required.
If we are going to deploy the solution in a one-layer laptop installation, it is very straightforward, but this is not what someone is going to deploy in the production site.
What's my experience with pricing, setup cost, and licensing?
Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free.
What other advice do I have?
We are well versed in Spark, the version, the internal structure of Spark, and we know what exactly Spark is doing.
The solution cannot be easier. Everything cannot be made simpler because it involves core data, computer science, pro-engineering, and not many people are actually aware of it.
I rate Apache Spark a six out of ten.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Co-Founder at FORMCEPT Technologies
Offers good machine learning, data learning, and Spark Analytics features
Pros and Cons
- "The features we find most valuable are the machine learning, data learning, and Spark Analytics."
- "We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data."
What is our primary use case?
We have built a product called "NetBot." We take any form of data, large email data, image, videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data.
What is most valuable?
We use all the features. We use it for end-to-end. All of our data analysis and execution happens through Spark.
The features we find most valuable are the:
- Machine learning
- Data learning
- Spark Analytics.
What needs improvement?
We've had problems using a Python process to try to access something in a large volume of data. It crashes if somebody gives me the wrong code because it cannot handle a large volume of data.
For how long have I used the solution?
I have been using Apache Spark for more than five years.
What do I think about the stability of the solution?
We haven't had any issues with stability so far.
What do I think about the scalability of the solution?
As long as you do it correctly, it is scalable.
Our users mostly consist of data analysts, engineers, data scientists, and DB admins.
Which solution did I use previously and why did I switch?
Before using this solution we used Apache Storm.
How was the initial setup?
The initial setup is complex.
What about the implementation team?
We installed it ourselves.
What other advice do I have?
I would rate it a nine out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: My company has a business relationship with this vendor other than being a customer: Partner
CEO International Business at a tech services company with 1,001-5,000 employees
A powerful open-source framework for fast, flexible, and versatile big data processing, with a strong learning curve and resource demands
Pros and Cons
- "The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations."
- "It requires overcoming a significant learning curve due to its robust and feature-rich nature."
What is our primary use case?
In AI deployment, a key step is aggregating data from various sources, such as customer websites, debt records, and asset information. Apache Spark plays a vital role in this process, efficiently handling continuous streams of data. Its capability enables seamless gathering and feeding of diverse data into the system, facilitating effective processing and analysis for generating alerts and insights, particularly in scenarios like banking.
What is most valuable?
The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations.
What needs improvement?
It requires overcoming a significant learning curve due to its robust and feature-rich nature.
For how long have I used the solution?
We have been using it for two years now.
What do I think about the stability of the solution?
It provides excellent stability. We never faced any issues with it.
What do I think about the scalability of the solution?
It ensures outstanding scalability capabilities.
Which solution did I use previously and why did I switch?
Opting for Apache Spark, an open-source solution, provides a distinct advantage by offering control over the code. This means you can identify issues, make necessary fixes, and determine what aspects to accept as they are. In contrast, dealing with a vendor may limit control, requiring you to submit requests and advocate for changes based on your business volume with them. This dependency on volume can potentially compromise control. To safeguard both your customers and your business, the choice of an open-source solution like Apache Spark allows for more autonomy and control over the technology stack.
What about the implementation team?
The system's smooth operation relies on deploying a comprehensive container with Kubernetes clusters, configured with essential toolsets. Instrumentation data from the backend is fed back to a central framework equipped with specific tools for driving various processes. In a case involving a customer with Red Hat and Postini clusters, the OpenShift Container Platform, comprising Kubernetes clusters, is used. The tools manage onboarding, infrastructure provisioning, certificate management, authorization control, etc. The deployment spans multiple independent data centers, like telecom circles in India, requiring unique approaches for various tasks, including disaster recovery planning and central alerting, facilitated through SaaS. The deployment process typically takes approximately forty to forty-five days for six thousand servers.
What was our ROI?
It provides a dual advantage by saving both time and money while enhancing performance, particularly by leveraging my skill sets.
What's my experience with pricing, setup cost, and licensing?
It is an open-source solution, it is free of charge.
What other advice do I have?
I would give it a rating of seven out of ten, which, by my standards, is quite high.
Which deployment model are you using for this solution?
Private Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Other
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Data Engineer at Berief Food GmbH
A useful and easy-to-deploy product that has an excellent data processing framework
Pros and Cons
- "The data processing framework is good."
- "The solution must improve its performance."
What is our primary use case?
Our customers configure their software applications, and I use Apache to check them. We use it for data processing.
What is most valuable?
The data processing framework is good. The product is very useful.
What needs improvement?
The solution must improve its performance.
For how long have I used the solution?
I have been using the solution for four to five years.
What do I think about the stability of the solution?
The tool is stable. I rate the stability more than nine out of ten.
What do I think about the scalability of the solution?
We have a small business. Around four people in my organization use the solution.
How was the initial setup?
The deployment was easy.
What about the implementation team?
The solution was deployed with the help of third-party consultants.
What other advice do I have?
Overall, I rate the product more than eight out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Director - Data Management, Governance and Quality at Hilton Worldwide
Powerful language but complicated coding
What is our primary use case?
Ingesting billions of rows of data all day.
How has it helped my organization?
Spark on AWS is not that cost-effective as memory is expensive and you cannot customize hardware in AWS. If you want more memory, you have to pay for more CPUs too in AWS.
What is most valuable?
Powerful language.
What needs improvement?
It is like going back to the '80s for the complicated coding that is required to write efficient programs.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Quantitative Developer at a marketing services firm with 11-50 employees
Seamless in distributing tasks, including its impressive map-reduce functionality
Pros and Cons
- "The distribution of tasks, like the seamless map-reduce functionality, is quite impressive."
- "When using Spark, users may need to write their own parallelization logic, which requires additional effort and expertise."
What is our primary use case?
Predominantly, I use Spark for data analysis on top of datasets containing tens of millions of records.
How has it helped my organization?
I have an example. We had a single-threaded application that used to run for about four to five hours, but with Spark, it got reduced to under one hour.
What is most valuable?
The distribution of tasks, like the seamless map-reduce functionality, is quite impressive. For the user, it appears as simple single-line data manipulations, but behind the scenes, the executor pool intelligently distributes the map and reduce functions.
What needs improvement?
The visualization could be improved.
For how long have I used the solution?
I have been working with Apache Spark for only a few months, not too long.
What do I think about the stability of the solution?
I haven't faced any stability issues. It has been stable in my experience.
What do I think about the scalability of the solution?
When it comes to the scalability of Spark, it's primarily a processing engine, not a database engine. I haven't tested it extensively with large record sizes.
In my organization, quite a few people are using Spark. In my smaller team, there are only two users.
What about the implementation team?
In terms of maintenance, when the load hits around 95%, we need to prioritize scripts and analysis within the team.
We coordinate and prioritize based on the available resources. If there were self-service tools or better hand-holding for such situations, it would make things easier.
Which other solutions did I evaluate?
Currently, we extensively use pandas and Polaris. We are leveraging Docker and Kubernetes as a framework, along with AWS Batch for distribution. This is the closest substitute we have for Spark Distribution.
Both Docker and Kubernetes are more general-purpose solutions. If someone is already using Kubernetes and it's provided as a service, it can be used for special-purpose utilization, similar to Docker and Kubernetes.
In such cases, users may need to write the parallelization logic themselves, but it's relatively easy to onboard and start with a distributed load. Spark, on the other hand, is primarily used for special-purpose utilization. Users typically choose Spark when they have data-intensive tasks.
Another significant issue with Spark is its syntactics. For instance, if we have libraries like Panda or Polaris, we can run them single-threaded on a single core, or we can distribute them leveraging Kubernetes.
We don't need to rewrite that code base for Spark. However, if we are writing code specifically for Spark Executors, it will not be amenable to running it locally.
What other advice do I have?
I would recommend understanding the use case better. Only if it fits your use case, then go for it. But it is a great tool.
Overall, I would rate Apache Spark an eight out of ten.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Updated: November 2024
Popular Comparisons
Amazon EMR
Cloudera Distribution for Hadoop
Spark SQL
IBM Spectrum Computing
Hortonworks Data Platform
Informatica Big Data Parser
IBM Db2 Big SQL
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which is the best RDMBS solution for big data?
- Apache Spark without Hadoop -- Is this recommended?
- Which solution has better performance: Spring Boot or Apache Spark?
- AWS EMR vs Hadoop
- Handling real and fast data - how do BigInsight and other solutions perform?
- When evaluating Hadoop, what aspect do you think is the most important to look for?
- Should we choose InfoSphere BigInsights or Cloudera?