I use the solution for data lakes and big data solutions. I can combine it with the other program languages.
Senior Data Archirect at Yettel
Parallel computing helped create data lakes with near real-time loading
Pros and Cons
- "It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."
- "If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation."
What is our primary use case?
What is most valuable?
One of the reasons we use Spark is so we can use parallelism in data lakes. So in our case, we can get many data nodes, and the main power of Hadoop and big data solutions is the number of nodes usable for different operations. It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance. Also, Spark has an option for near real-time loading and processing. We use micro batches of Spark.
What needs improvement?
If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation. In combination with other tools, many sessions remain, even if you think they've stopped. This is the main problem with big data sessions, where zombie sessions reside that you have to take care of. Otherwise, they spend resources and cause problems.
For how long have I used the solution?
I've been using Apache Spark for more than two years. I'm using the latest version.
Buyer's Guide
Apache Spark
December 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: December 2024.
824,067 professionals have used our research since 2012.
What do I think about the stability of the solution?
The solution is stable, but not completely. For example, we use Exadata as an extremely stable data warehouse, but that's not possible with big data. There are things that you have to fix sometimes. The stability is similar to the cloud solution, but that depends on the solution you need.
What do I think about the scalability of the solution?
The solution is scalable, but adding new nodes is not easy. It will take some time to do that, but it's scalable. We have about 20 users using Apache Spark. We regularly use the solution.
How are customer service and support?
We use Cloudera distribution, so we ask Cloudera for support, which is not open-source.
How was the initial setup?
When you install the complete environment, you install Spark as a part of this solution. The setup can be tricky when introducing security, such as connecting Spark using Kerberos. It can be tricky because when you use it, you have to distribute your architecture with many servers, and even then, you have to prepare Kerberos on every server. It's not possible to do this in one place.
Deploying Apache Spark is pretty complex. But that is a problem with the security approach. Our security guys requested this security, so we use Kerberos authentication mandatorily, which can be complex. We had five people for maintenance and deployment, not to mention deployment or other roles.
What about the implementation team?
We had an external integrator, but we also had in-house knowledge. Sometimes, we need to change or install something, and it's not good to ask the integrator for everything because of availability and planning. We had more freedom thanks to our internal knowledge.
What's my experience with pricing, setup cost, and licensing?
Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera. But in that case, you don't have any support. If you face a problem, you might find something in the community, but you cannot ask Cloudera about it. If you have open source, you don't have support, but you have a community. Cloudera has different packages, which are licensed versions of products like Apache Spark. In this case, you can ask Cloudera for everything.
What other advice do I have?
Spark was written in Scala. Scala is a programming language fundamentally in Java and useful for data lakes.
We thought about using Flink instead, but it wasn't useful for us and wouldn't gain any additional value. Besides, Spark's community is much wider, so information is available and is better than Flink's.
I rate Apache Spark an eight out of ten.
If you plan to implement Apache Spark on a large-scale system, you should learn to use parallelism, partitioning, and everything from the physical level to get the best performance from Spark. And it will be good to know Python, especially for data scientists using PySpark for analysis. Likewise, it's good to know Scala because you can be very efficient in preparing some datasets since it is Spark's native language.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Analyst at Deloitte
Processes a larger volume of data efficiently and integrates with different platforms
Pros and Cons
- "The product’s most valuable features are lazy evaluation and workload distribution."
- "They could improve the issues related to programming language for the platform."
What is our primary use case?
We use the product in our environment for data processing and performing Data Definition Language (DDL) operations.
What is most valuable?
The product’s most valuable features are lazy evaluation and workload distribution.
What needs improvement?
They could improve the issues related to programming language for the platform.
For how long have I used the solution?
We have been using Apache Spark for around two and a half years.
What do I think about the stability of the solution?
The platform’s stability depends on how effectively we write the code. We encountered a few issues related to programming languages.
What do I think about the scalability of the solution?
We have more than 100 Apache Spark users in our organization.
Which solution did I use previously and why did I switch?
Before choosing Apache Spark for processing big data, we evaluated another option, Hadoop. However, Spark emerged as a superior choice comparatively.
How was the initial setup?
The initial setup complexity depends on whether it's on the cloud or on-premise. For cloud deployments, especially using platforms like Databricks, the process is straightforward and can be configured with ease. However, if the deployment is on-premise, the setup tends to be more time-consuming, although not overly complex.
What's my experience with pricing, setup cost, and licensing?
They provide an open-source license for the on-premise version. However, we have to pay for the cloud version including data centers and virtual machines.
What other advice do I have?
Apache Spark is a good product for processing large volumes of data compared to other distributed systems. It provides efficient integration with Hadoop and other platforms.
I rate it a ten out of ten.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Apache Spark
December 2024
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: December 2024.
824,067 professionals have used our research since 2012.
Data Engineer at BBD
A reliable and scalable open-source framework for big data processing that excels in speed, fault tolerance, and support for various data sources
Pros and Cons
- "It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained."
- "One limitation is that not all machine learning libraries and models support it."
What is our primary use case?
We use it for data engineering and analytics to process and examine extensive datasets.
What is most valuable?
It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained.
What needs improvement?
One limitation is that not all machine learning libraries and models support it. While libraries like Scikit-learn may work with some Spark-compatible models, not all machine-learning tools are compatible with Spark. In such cases, you may need to extract data from Spark and train your models on smaller datasets instead of directly using Spark for training.
For how long have I used the solution?
I have been using it for four years.
What do I think about the stability of the solution?
I have not encountered any significant stability issues and it has proven to be a robust and reliable platform without major crashes. However, there have been instances where I needed to address query optimization and similar tasks to ensure optimal performance. I would rate it nine out of ten.
How are customer service and support?
To rate my overall experience, I would give it an eight out of ten, leaving room for potential improvements in terms of technical support.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
We used Pandas data frames and SQL-type queries for smaller datasets, but we haven't worked with anything on the scale of Spark SQL.
How was the initial setup?
I haven't handled the deployment process, but setting it up on the cloud seems relatively straightforward.
What about the implementation team?
Setting it up on-premises might take longer, potentially a couple of days. However, when deploying it on the cloud, the process can be significantly quicker, possibly taking only a few hours.
What's my experience with pricing, setup cost, and licensing?
On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing. Managing costs in a cloud environment can be challenging due to the cumulative expenses associated with running and maintaining Spark. Licensing costs may not be the primary concern, but operational costs in the cloud can add up. For on-premises deployments, maintenance costs include cluster management, job optimization, and upgrades. In the cloud, maintenance costs are relatively lower, especially with managed database clusters, but they still exist and primarily revolve around cluster upkeep.
Which other solutions did I evaluate?
We evaluated Microsoft Synapse, which offers similar analytics functionality but not quite at the same scale as Apache Spark and Spark as a whole. While some tasks can be accomplished with Synapse on AWS, there are certain features and capabilities, such as micro-batching and scalability, that Spark excels at and remains unmatched.
What other advice do I have?
Additional skill requirements are crucial to use the solution and its related features effectively. Training costs and efforts may be necessary to ensure individuals are proficient in using these technologies. Overall, I would rate it nine out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Head of Data Science center of excellence at Ameriabank CJSC
Enhanced data processing with good support and helpful integration with Pandas syntax in distributed mode
Pros and Cons
- "The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features."
- "The main concern is the overhead of Java when distributed processing is not necessary."
What is our primary use case?
The primary use case for Apache Spark is to process data in memory, using big data, and distributing the engine to process said data. It is used for various tasks such as running the association rules algorithm in ML Spark ML, running XGBoost in parallel using the Spark engine, and preparing data for online machine learning using Spark Streaming mode.
How has it helped my organization?
The most significant cost savings come from the operational side because Spark is very typical in operations. There are many experts available in the market to operate Spark, making it easier to find the right personnel. It is quite mature, which reduces operation costs.
What is most valuable?
The most significant advantage of Spark 3.0 is its support for DataFrame UDF Pandas UDF features. This allows running Pandas code distributed by using the Spark engine, which is a crucial feature. The integration with Pandas syntax in distributed mode, along with the user-defined functions in PySpark, is particularly valuable.
What needs improvement?
The main concern is the overhead of Java when distributed processing is not necessary. In such cases, operations can often be done on one node, making Spark's distributed mode unnecessary. Consequently, alternatives like Doc DB are more preferable. Additionally, performance in some cases is slower, making alternatives two to five times faster.
For how long have I used the solution?
I have more than ten years of experience using Spark, starting from when it was first introduced.
What do I think about the stability of the solution?
Spark is very stable for our needs. It offers amazing stability.
What do I think about the scalability of the solution?
Scalability depends on how infrastructure is organized. Better balance and network considerations are necessary. However, Spark is very stable when scaled appropriately.
How are customer service and support?
Customer support for Apache Spark is very good. There is a lot of documentation and forums available, making it easier to find solutions. The Databricks team also does a lot to support Spark.
How would you rate customer service and support?
Positive
How was the initial setup?
The initial setup of Spark can take about a week, assuming the right infrastructure is already in place.
What about the implementation team?
A few technicians are typically required for installation and configuration. SRE engineers or operational guys handle the setup, as they need to understand the details about installation and configuration. Maintenance usually requires just an SRE engineer or operational guy.
What was our ROI?
The main benefit in terms of ROI comes from the operation side. Spark’s operational costs are lower due to the availability of experts and its maturity. However, performance costs might be higher due to the need for more memory and infrastructure.
What's my experience with pricing, setup cost, and licensing?
Compared to other solutions like Doc DB, Spark is more costly due to the need for extensive infrastructure. It requires significant investment in infrastructure, which can be expensive. While cloud solutions like Databricks can simplify the process, they may also be less cost-efficient.
What other advice do I have?
I'd rate the solution eight out of ten.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Other
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Last updated: Sep 25, 2024
Flag as inappropriateModule Lead at Mphasis
Helps to build ETL pipelines load data to warehouses
Pros and Cons
- "The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily."
- "Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial."
What is our primary use case?
We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.
What is most valuable?
The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily.
What needs improvement?
Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial.
For how long have I used the solution?
I have been using the product for six years.
What do I think about the stability of the solution?
Apache Spark is generally considered a stable product, with rare instances of breaking down. Issues may arise in sudden increases in data volume, leading to memory errors, but these can typically be managed with autoscaling clusters. Additionally, schema changes or irregularities in streaming data may pose challenges, but these could be addressed in future software versions.
What do I think about the scalability of the solution?
About 70-80 percent of employees in my company use the product.
How are customer service and support?
We haven't contacted Apache Spark support directly because it's an open-source tool. However, when using it as a product within Databricks, we've contacted Databricks support for assistance.
Which solution did I use previously and why did I switch?
The main reason our company opted for the product is its capability to process large volumes of data. While other options like Snowflake offer some advantages, they may have limitations regarding custom logic or modifications.
How was the initial setup?
The solution's setup and installation of Apache Spark can vary in complexity depending on whether it's done in a standalone or cluster environment. The process is generally more straightforward in a standalone setup, especially if you're familiar with the concepts involved. However, setting up in a cluster environment may require more knowledge about clusters and networking, making it potentially more complex.
What's my experience with pricing, setup cost, and licensing?
The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks.
What other advice do I have?
If you're new to Apache Spark, the best way to learn is by using the Databricks Community Edition. It provides a cluster for Apache Spark where you can learn and test. I rate the product an eight out of ten.
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Associate Director at a consultancy with 10,001+ employees
High performance, beneficial in-memory support, and useful online community support
Pros and Cons
- "One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."
- "Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."
What is our primary use case?
Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.
What is most valuable?
One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast.
What needs improvement?
Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors.
For how long have I used the solution?
I have been using Apache Spark for approximately five years.
What do I think about the stability of the solution?
Apache Spark is stable.
What do I think about the scalability of the solution?
I have found Apache Spark to be scalable.
How are customer service and support?
Apache Spark is open-source, there is no team that will give you dedicated support, but you can post your queries on the community forums, and usually, you will receive a good response. Since it's open-source, you depend on freelance developers to respond to you, you cannot put a time limit there, but the response, on average, is pretty good.
How was the initial setup?
If Apache Spark is in the cloud, setting it up will require only minutes. If it's on Amazon, GCP, or Microsoft cloud, it'll take minutes to set everything up. However, if you are using the on-premise version, then it might take some time to set up the environment.
What other advice do I have?
I rate Apache Spark an eight out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Cloud and Big Data Engineer | Developer at Huawei Cloud Middle East
A scalable solution that can be used for data computation and building data pipelines
Pros and Cons
- "The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast."
- "Apache Spark should add some resource management improvements to the algorithms."
What is our primary use case?
Apache Spark is used for data computation, building data pipelines, or building analytics on top of batch data. Apache Spark is used to handle big data efficiently.
What is most valuable?
The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast.
What needs improvement?
Apache Spark should add some resource management improvements to the algorithms. Thereby, the solution can manage SKUs more efficiently with a physical and logical plan over the different data sets when you are joining it.
For how long have I used the solution?
I have been working with Apache Spark for six to seven years.
What do I think about the stability of the solution?
Apache Spark is a very stable solution. The community is still working on other parts, like performance and removing bottlenecks. However, from a stipulative point of view, the solution's stability is very good.
I rate Apache Spark a nine out of ten for stability.
What do I think about the scalability of the solution?
Apache Spark is a scalable solution. More than 50 to 100 users are using the solution in our organization.
How are customer service and support?
Apache Spark's technical support team responds on time.
How would you rate customer service and support?
Positive
How was the initial setup?
The solution’s initial setup is very easy.
What's my experience with pricing, setup cost, and licensing?
Apache Spark is an open-source solution, and there is no cost involved in deploying the solution on-premises.
What other advice do I have?
I would recommend Apache Spark to users doing analytics, data computation, or pipelines.
Overall, I rate Apache Spark ten out of ten.
Which deployment model are you using for this solution?
Hybrid Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Software Architect at Akbank
Provides fast aggregations, AI libraries, and a lot of connectors
Pros and Cons
- "AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI."
- "Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."
What is our primary use case?
We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase.
Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them.
This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.
How has it helped my organization?
Aggregations are very fast in our project since we started to use Spark. We can tell results in around 300 milliseconds. Before using Spark, the time was around 700 milliseconds.
Before using Spark, we only used Couchbase. We needed fast results for this project because transactions come from various channels, and we need to decide and resolve them at the earliest because users are performing the transaction. If our result or process takes longer, users might stop or cancel their transactions, which means losing money. Therefore, fast results time is very important for us.
What is most valuable?
AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI.
What needs improvement?
Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.
For how long have I used the solution?
I am a Java developer. I have been interested in Spark for around five years. We have been actively using it in our organization for almost a year.
What do I think about the stability of the solution?
It is the most stable platform. As compare to Flink, Spark is good, especially in terms of clusters and architecture. My colleagues who set up these clusters say that Spark is the easiest.
What do I think about the scalability of the solution?
It is scalable, but we don't have the need to scale it.
It is mainly used for reporting big data in our organization. All teams, especially the VR team, are using Spark for job execution and remote execution. I can say that 70% of users use Spark for reporting, calculations, and real-time operations. We are a very big company, and we have around a thousand people in IT.
We will continue its usage and develop more. We have kind of just started using it. We finished this project just three months ago. We are now trying to find out bottlenecks in our systems, and then we are ready to go.
How are customer service and technical support?
We have not used Apache support. We have only used Cloudera support for this project, and they helped us a lot during the development cycle of this project.
How was the initial setup?
I don't have any idea about it. We are a big company, and we have another group for setting up Spark.
What other advice do I have?
I would advise planning well before implementing this solution. In enterprise corporations like ours, there are a lot of policies. You should first find out your needs, and after that, you or your team should set it up based on your needs. If your needs change during development because of the business requirements, it will be very difficult.
If you are clear about your needs, it is easier to set it up. If you know how Spark is used in your project, you have to define firewall rules and cluster needs. When you set up Spark, it should be ready for people's usage, especially for remote job execution.
I would rate Apache Spark a nine out of ten.
Which deployment model are you using for this solution?
On-premises
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Updated: December 2024
Popular Comparisons
Amazon EMR
Cloudera Distribution for Hadoop
Spark SQL
IBM Spectrum Computing
Hortonworks Data Platform
Informatica Big Data Parser
IBM Db2 Big SQL
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- Which is the best RDMBS solution for big data?
- Apache Spark without Hadoop -- Is this recommended?
- Which solution has better performance: Spring Boot or Apache Spark?
- AWS EMR vs Hadoop
- Handling real and fast data - how do BigInsight and other solutions perform?
- When evaluating Hadoop, what aspect do you think is the most important to look for?
- Should we choose InfoSphere BigInsights or Cloudera?