There are several valuable features.
- Interactive data access (low latency)
- Batch ETL-style processing
- Schema-free data models
- Algorithms
There are several valuable features.
We have 1000x improvement in performance over other techniques. It's enabled interactive self-service access to data.
Better integration of BI tools wold be a much appreciated improvement.
I've used it for about 14 months.
I haven't had any issues with deployment.
It's been stable for us.
It's scaled without issue.
Customer service is excellent.
Technical Support:Technical support is excellent.
Yes, we previously used Oracle, from which we ported our data.
The initial setup was simple.
We implemented it with our in-house team.
Be sure to Uuse the Apache versions and avoid vendor-specific extensions.
We use Apache Spark to prepare data for transformation and encryption, depending on the columns. We use AES-256 encryption. We're building a proof of concept at the moment. We prepare patches on Spark for Kubernetes on-premise and Google Cloud Platform.
I like that it can handle multiple tasks parallelly. I also like the automation feature. JavaScript also helps with the parallel streaming of the library.
The logging for the observability platform could be better.
I know about this technology for a long time, maybe for about three years.
Because my area is data analytics and analytics solutions, I use BigQuery, SQL, and ETL. I also use Dataproc and DataFlow.
We use an integrator sometimes, but recently we put together a team to support the infrastructural requirements. This is because the proof of concept is self-administered.
I would recommend Apache Spark to new users, but it depends on the use case. Sometimes, it's not the best solution.
On a scale from one to ten, I would give Apache Spark a ten.
We use this solution for information gathering and processing.
I use it myself when I am developing on my laptop.
I am currently using an on-premises deployment model. However, in a few weeks, I will be using the EMR version on the cloud.
The most valuable feature of this solution is its capacity for processing large amounts of data.
This solution makes it easy to do a lot of things. It's easy to read data, process it, save it, etc.
When you first start using this solution, it is common to run into memory errors when you are dealing with large amounts of data. Once you are experienced, it is easier and more stable.
When you are trying to do something outside of the normal requirements in a typical project, it is difficult to find somebody with experience.
This solution is difficult for users who are just beginning and they experience out of memory errors when dealing with large amounts of data.
I have not been in contact with technical support. I find all of the answers that I need in the forums.
The work that we are doing with this solution is quite common and is very easy to do.
My advice for anybody who is implementing this solution is to look at their needs and then look at the community. Normally, there are a lot of people who have already done what you need. So, even without experience, it is quite simple to do a lot of things.
I would rate this solution a nine out of ten.
Used for building big data platforms for processing huge volumes of data. Additionally, streaming data is critical.
It provides a scalable machine learning library so that we can train and predict user behavior for promotion purposes.
Machine learning, real time streaming, and data processing are fantastic, as well as the resilient or fault tolerant feature.
I would suggest for it to support more programming languages, and also provide an internal scheduler to schedule spark jobs with monitoring capability.
Spark is relatively easy to deploy, with rich features in handling big data. Spark Core, Spark SQL, Spark MLlib are used mostly in our applications.
I use Spark to process large amount of data in the energy industry.
Good tool to analyse Spark application performance. Right now there are still many parameters to tune in order to get good performance of Spark application, I would like to see the auto tuning of parameters.
I've been using Spark for seven months.
There were no issues with the deployment.
I ran into Spark application performance issues. For instance, Spark JDBC write performance needs to be improved.
There were no issues with the scalability.
I use Apache open source. Everything is on our own.
Technical Support:I use Apache open source. Everything is on our own.
I evaluated Hadoop-based solution, and chose Spark due to the fast processing and ease of use.
The initial setup is not complex. The online documents are pretty good.
I implemented it in-house.
Get to know how Spark works, what are job, stage, task, DAG, etc., and it will help you to write Spark application.
Our use case for Apache Spark was a retail price prediction project. We were using retail pricing data to build predictive models. To start, the prices were analyzed and we created the dataset to be visualized using Tableau. We then used a visualization tool to create dashboards and graphical reports to showcase the predictive modeling data.
Apache Spark was used to host this entire project.
The processing time is very much improved over the data warehouse solution that we were using.
The most valuable features are the storage engine, the memory engine, and the processing engine.
I would like to see integration with data science platforms to optimize the processing capability for these tasks.
I have been using Apache Spark for the past year.
We have not been in contact with technical support.
The initial setup is straightforward. It took us around one week to set it up, and then the requirements and creation of the project flow and design needed to be done. The design stage took three to four weeks, so in total, it required between four and five weeks to set up.
I would rate this solution an eight out of ten.
Organisations can now harness richer data sets and benefit from use cases, which add value to their business functions.
Distributed in memory processing. Some of the algorithms are resource heavy and executing this requires a lot of RAM and CPU. With Hadoop-related technologies, we can distribute the workload with multiple commodity hardware.
Include more machine learning algorithms and the ability to handle streaming of data versus micro batch processing.
At times when users do not know how to use Spark and request a lot of resources, then the underlying JVMs can crash, which is a big sense of worry.
No issues.
We primarily use the solution for security analytics.
The scalability has been the most valuable aspect of the solution.
The management tools could use improvement. Some of the debugging tools need some work as well. They need to be more descriptive.
The 2.3 version is quite stable. All of our customers use it, there are around 100,000+ users, and it runs 24/7.
The scalability is very good.
You actually buy Cloudera along with it. You don't really get any support, except you need support.
In previous companies, we used MySQL platform and solutions like ArcSight and Splunk. We switched for scalability. MySQL wasn't going to scale, and we don't use Splunk at this company.
The initial setup was complex. It is a complex tool. It's a lot to do with how you will use it. There is a lot to set up. They need to put a lot of scripts to it. There's nearly 60 to set up. When you set up the cloud, it takes about a day to set up. If you set it up on-premise, you know, on hardware, it only takes about a week.
I would rate this solution eight out of 10.