The primary use case for Apache Spark is to process data in memory, using big data, and distributing the engine to process said data. It is used for various tasks such as running the association rules algorithm in ML Spark ML, running XGBoost in parallel using the Spark engine, and preparing data for online machine learning using Spark Streaming mode.
Data Scientist at a financial services firm with 10,001+ employees
Real User
Top 10
2024-07-10T15:58:00Z
Jul 10, 2024
Most of my use cases involve data processing. For example, someone tried to run sentiment analysis on Databricks using Apache Spark. They had to handle data from many countries and languages, which presented some challenges. Besides that, I primarily use Apache Spark for data processing tasks. I work with mobile phone datasets, around one terabyte in size. This involves extracting and analyzing data before building any models.
CEO International Business at a tech services company with 1,001-5,000 employees
MSP
Top 5
2023-11-10T13:04:33Z
Nov 10, 2023
In AI deployment, a key step is aggregating data from various sources, such as customer websites, debt records, and asset information. Apache Spark plays a vital role in this process, efficiently handling continuous streams of data. Its capability enables seamless gathering and feeding of diverse data into the system, facilitating effective processing and analysis for generating alerts and insights, particularly in scenarios like banking.
Apache Spark can be used in multiple use case in big data and in data engineering task. We are using Apache spark for ETL, integration with streaming data and performing real time prediction like anomaly, price prediction and data exploration on large volume of data.
It's a root product that we use in our pipeline. We have some input data. For example, we have one system that supplies some data to MongoDB, for example, and we pull this data from MongoDB, enrich this data from other systems - with some additional fields - and write to S3 for other systems. Since we have a lot of data, we need a parallel process that runs hourly.
Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.
Chief Data-strategist and Director at Theworkshop.es
Real User
Top 10
2021-08-18T14:51:07Z
Aug 18, 2021
You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.
Senior Solutions Architect at a retailer with 10,001+ employees
Real User
2021-03-27T15:39:24Z
Mar 27, 2021
We use Apache Spark to prepare data for transformation and encryption, depending on the columns. We use AES-256 encryption. We're building a proof of concept at the moment. We prepare patches on Spark for Kubernetes on-premise and Google Cloud Platform.
We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them. This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.
When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.
Managing Consultant at a computer software company with 501-1,000 employees
Real User
2020-02-02T10:42:14Z
Feb 2, 2020
Our use case for Apache Spark was a retail price prediction project. We were using retail pricing data to build predictive models. To start, the prices were analyzed and we created the dataset to be visualized using Tableau. We then used a visualization tool to create dashboards and graphical reports to showcase the predictive modeling data. Apache Spark was used to host this entire project.
We have built a product called "NetBot." We take any form of data, large email data, image, videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data.
Technical Consultant at a tech services company with 1-10 employees
Consultant
2019-12-23T07:05:00Z
Dec 23, 2019
We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.
We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers.
Senior Consultant & Training at a tech services company with 51-200 employees
Consultant
2019-10-13T05:48:00Z
Oct 13, 2019
We use this solution for information gathering and processing. I use it myself when I am developing on my laptop. I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.
Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflowstructure on distributed programs: MapReduce programs read input data from disk, map a function...
The primary use case for Apache Spark is to process data in memory, using big data, and distributing the engine to process said data. It is used for various tasks such as running the association rules algorithm in ML Spark ML, running XGBoost in parallel using the Spark engine, and preparing data for online machine learning using Spark Streaming mode.
Most of my use cases involve data processing. For example, someone tried to run sentiment analysis on Databricks using Apache Spark. They had to handle data from many countries and languages, which presented some challenges. Besides that, I primarily use Apache Spark for data processing tasks. I work with mobile phone datasets, around one terabyte in size. This involves extracting and analyzing data before building any models.
We use the product in our environment for data processing and performing Data Definition Language (DDL) operations.
In my company, the solution is used for batch processing or real-time processing.
Our primary use case is for interactively processing large volume of data.
We use it for real-time and near-real-time data processing. We use it for ETL purposes as well as for implementing the full transformation pipelines.
In AI deployment, a key step is aggregating data from various sources, such as customer websites, debt records, and asset information. Apache Spark plays a vital role in this process, efficiently handling continuous streams of data. Its capability enables seamless gathering and feeding of diverse data into the system, facilitating effective processing and analysis for generating alerts and insights, particularly in scenarios like banking.
We use Apache Spark for storage and processing.
Our customers configure their software applications, and I use Apache to check them. We use it for data processing.
Predominantly, I use Spark for data analysis on top of datasets containing tens of millions of records.
We use Spark for machine learning applications, clustering, and segmentation of customers.
Apache Spark can be used in multiple use case in big data and in data engineering task. We are using Apache spark for ETL, integration with streaming data and performing real time prediction like anomaly, price prediction and data exploration on large volume of data.
It's a root product that we use in our pipeline. We have some input data. For example, we have one system that supplies some data to MongoDB, for example, and we pull this data from MongoDB, enrich this data from other systems - with some additional fields - and write to S3 for other systems. Since we have a lot of data, we need a parallel process that runs hourly.
I am using Apache Spark for the data transition from databases. We have customers who have one database as a data lake.
Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.
I use Spark to run automation processes driven by data.
I mainly use Spark to prepare data for processing because it has APIs for data evaluation.
The solution can be deployed on the cloud or on-premise.
You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.
We use Apache Spark to prepare data for transformation and encryption, depending on the columns. We use AES-256 encryption. We're building a proof of concept at the moment. We prepare patches on Spark for Kubernetes on-premise and Google Cloud Platform.
I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.
We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase. Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them. This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.
When we receive data from the messaging queue, we process everything using Apache Spark. Data Bricks does the processing and sends back everything the Apache file in the data lake. The machine learning program does some kind of analysis using the ML prediction algorithm.
Our use case for Apache Spark was a retail price prediction project. We were using retail pricing data to build predictive models. To start, the prices were analyzed and we created the dataset to be visualized using Tableau. We then used a visualization tool to create dashboards and graphical reports to showcase the predictive modeling data. Apache Spark was used to host this entire project.
We have built a product called "NetBot." We take any form of data, large email data, image, videos or transactional data and we transform unstructured textual data videos in their structured form into reading into transactional data and we create an enterprise-wide smart data grid. That smart data grid is being used by the downstream analytics tool. We also provide machine-building for people to get faster insight into their data.
We are working with a client that has a wide variety of data residing in other structured databases, as well. The idea is to make a database in Hadoop first, which we are in the process of building right now. One place for all kinds of data. Then we are going to use Spark.
We primarily use the solution to integrate very large data sets from another environment, such as our SQL environment, and draw purposeful data before checking it. We also use the solution for streaming very very large servers.
We use this solution for information gathering and processing. I use it myself when I am developing on my laptop. I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.
We primarily use the solution for security analytics.
We use the solution for analytics.
Streaming telematics data.
Ingesting billions of rows of data all day.
Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.