What is our primary use case?
We are using this solution to run large analytics queries and prepare datasets for SparkML and ML using PySpark.
We ran on multiple clusters set up for a minimum of three and a maximum of nine nodes having 16GB RAM each.
For one ad hoc requirement, a 32-node cluster was required.
Databricks clusters were set for autoscaling and to time out after forty minutes of inactivity. Multiple users attached their notebooks to a cluster. When some workloads required different libraries, a dedicated cluster was spun up for that user.
How has it helped my organization?
Databricks took care of all the underlying cluster management seamlessly. We could configure our clusters to run and deliver results without any delays due to hardware configuration or installation issues.
Databricks allowed us to go from non-existent insights (because the datasets were just too large) to immediate and rich insights once the datasets were ingested into our PySpark notebooks.
What is most valuable?
Immense ease in running very large scale analytics, with a convenient and slick UI. This saved us from having to tweak, tune, dive into deeper abstractions, get involved in procurement, and also having to wait for other workloads to run.
The built-in optimization recommendations halved the speed of queries and allowed us to reach decision points and deliver insights very quickly.
The Delta data format proved excellent. Databricks had already done the heavy lifting and optimized the format for large scale interactive querying. They saved us a lot of time.
What needs improvement?
The product could be improved by offering an expansion of their visualization capabilities, which currently assists in development in their notebook environment. Perhaps a few connectors that auto-deploy to a reporting server?
More parallelized Machine Learning libraries would be excellent for predictive analytics algorithms.
For how long have I used the solution?
I have been using this solution for three years.
What do I think about the stability of the solution?
This solution is stable and proved very robust. When very obvious programmatic recommendations were not followed, causing memory overruns on a driver, the clusters required restarting.
What do I think about the scalability of the solution?
Absolutely, seamlessly, and massively scalable, within only budgetary limits. Also, the product itself offers real-time efficiency and optimization recommendations.
How are customer service and support?
So brilliant, it was never required. Their documentation is comprehensive, clear, simple, and thorough.
Which solution did I use previously and why did I switch?
Previously I used Hive and Livy in Zeppelin on an in-house Hadoop installation. The queries constantly threw exceptions and timeouts and the necessary configuration changes proved time-consuming and problematic. Databricks, on the other hand, simply made all those problems vanish.
How was the initial setup?
Setup and Support are single-click.
What about the implementation team?
We used an in-house team for implementation.
What was our ROI?
Our ROI was of the order of USD $75k per year for one deployment. We were able to switch our workloads from an onsite Hadoop cluster, billed to our department for more than USD $100k per year, to a Databricks workspace in the cloud for a quarter of that expenditure.
Further, we were able to transparently and efficiently scale our queries to run under fifteen minutes per major analytics use case, while being subject to unstable queries and highly brittle data flow use cases from the in-house Hadoop cluster.
We are further reducing spending on our traditional RDBMS solution by offloading reporting workloads to the Databricks PySpark notebooks, which is reducing our expensive datacenter resources and freeing up RDBMS resources for OLTP loads.
What's my experience with pricing, setup cost, and licensing?
Set up a cluster in your cloud of choice, but Databricks' service might also be very competitive as their pricing units will be built in.
Licensing on site I would counsel against, as on-site hardware issues tend to really delay and slow down delivery.
Which other solutions did I evaluate?
I evaluated Hortonworks, Livy, and Zeppelin. These were unsuitable due to the unavailability of sufficiently skilled personnel.
What other advice do I have?
By investing in people skilled in data querying, Python coding, and even basic Data Science, a Databricks setup will reward the business.
Once the Databricks data flows are established, it is a matter of a few incremental steps to opening up streaming and running up-to-the-minute queries, allowing the business to build its data-driven processes.
Databricks continues to advance the state-of-the-art and will be my go-to choice for mission-critical PySpark and ML workflows.
Which deployment model are you using for this solution?
Public Cloud
Disclosure: I am a real user, and this review is based on my own experience and opinions.