Senior Data Engineer at a computer software company with 1,001-5,000 employees
Real User
Top 10
2024-11-06T10:14:00Z
Nov 6, 2024
Performance could be improved. It is crucial to check coding, configure Spark correctly, implement caching, and monitor performance metrics to enhance performance.
Data Platform Architect at a tech services company with 51-200 employees
Real User
Top 20
2024-07-12T13:55:30Z
Jul 12, 2024
The product could be improved regarding the delay when switching to higher-performing virtual machines compared to other platforms like Snowflake. The ease and speed of managing clusters can also be enhanced, especially when scaling up resources. They could add more advanced data storage solutions like Iceberg and Delta files.
Financial Analyst 4 (Supply Chain & Financial Analytics) at Juniper Networks
MSP
Top 5
2024-03-28T09:56:00Z
Mar 28, 2024
Databricks would have more collaborative features than it has. It should have some more customization for the jobs. Also, it has an average dashboarding tool. They can bring advanced features so we don't depend on other BI tools to build a dashboard. We are using Tableau to create a dashboard. If Databricks has more advanced features, we can entirely use Databricks.
There is room for improvement in the documentation of processes and how it works. I was trying to get one of the certifications, so I saw an area of improvement there.
Databricks has added some alerts and query functionality into their SQL persona, but the whole SQL persona, which is like a role, needs a lot of development. The alerts are not very flexible, and the query interface itself is not as polished as the notebook interface that is used through the data science and machine learning persona. It is clunky at present.
Databricks' performance when serving the data to an analytics tool isn't as good as Snowflake's. In the next release, Databricks should include a better data-sharing platform to facilitate data sharing between companies.
Microsoft Azure has its learning environment on the Microsoft website. We can complete certifications, but the Databricks certification is more expensive than Microsoft. It costs between $2,000 and $2,500, and the knowledge is linked. They're also charged based on whether a person doesn't want to analyze large amounts of data. Hence, we want to have the capacity for free student users so that people can learn and build their professional skills.
Principal at a computer software company with 5,001-10,000 employees
Real User
2022-12-16T18:28:24Z
Dec 16, 2022
I have had some issues with some of the Spark clusters running on Databricks, where the Spark runtime and clusters go up and down, which is an area for improvement. Still, I am generally unaware of any super-critical issues.
Tech Lead Consultant | Manager Data Engineering at Ekimetrics
Real User
2022-11-07T12:27:39Z
Nov 7, 2022
I would love an integration in my desktop IDE. For now, I have to code on their webpage. They provide a web interface to do my code. However, I have my local software to do some coding for other projects, yet I cannot use it for Databricks, and I lose all my shortcuts. I lose all the benefits from my local IDE. If one day they would provide some integrations with VS code, for example, that would be game-changing. Having Databricks in my VS code would be the most amazing feature.
Head of Business Integration and Architecture at Jakala
Real User
2022-10-21T13:43:56Z
Oct 21, 2022
The data visualization for this solution could be improved. They have started to roll out a data visualization tool inside Databricks but it is in the early stages. It's not comparable to a solution like Power BI, Luca, or Tableau. In a future release, we would like to have a better ETL designer tool to assist in the way we move data from one place to another.
CI/CD needs additional leverage and support. Community forums are helpful for gaining knowledge but the solution should provide specific documentation. Streaming services such as Flink should be amplified and better supported. There are not many connectors to MapReduce.
Vice President at a tech services company with 51-200 employees
Real User
2022-09-06T08:03:58Z
Sep 6, 2022
I'm struggling a little because I wanted to do some POC solutions. I present a lot of projects in various forums and seminars and there aren't a lot of credits and trial options with Databricks. Even if we want to explore, we're not able to and that's a challenge. The solution is quite expensive.
Associate Principal - Data Engineering at LTI - Larsen & Toubro Infotech
Real User
2022-07-17T09:50:00Z
Jul 17, 2022
Every tool has room for improvement. Normally what happens, a solution will claim it can do ETL and everything else, but you encounter some limitations when you actually start. Then you keep on interacting with the vendor, and they continue to upgrade it. For example, we haven't fully implemented Databricks Unity Catalog, a newly introduced feature. We need to check how it works and then accordingly, there can be improvements in that also. Databricks may not be as easy to use as other tools, but if you simplify a tool too much, it won't have the flexibility to go in-depth. Databricks is completely in the programmer's hands. I prefer flexibility rather than simplicity.
Support for Microsoft technology and the compatibility with the .NET framework is somewhat missing. There should be reliability between these two. Databricks is based on open sources. If it's more synchronous between the Microsoft technology and the programming languages, it'll be better. Python has better languages, but compatibility would be a great help. I would like to have better support for Microsoft technology and better language components. With Azure or Cosmo DB, I can store other data links or time series data tables. That would be a great help for analytics in real time.
Manager, Customer Journey at a retailer with 10,001+ employees
Real User
2022-05-18T14:11:55Z
May 18, 2022
I would like it if Databricks adopted an interface more like R Studio. When I create a data frame or a table, R Studio provides a preview of the data. In R Studio, I can see that it created a table with so many columns or rows. Then I can click on it and open a preview of that data. Because I work in analytics and not data engineering, I think that's probably the biggest one. There are better graphical tools, so I don't think Databricks can compete. You can do a simple graph, and it's not that great. However, I don't think they can ever stack up to Tableau, so it's probably not worth it to improve upon that.
Director - Data Engineering expert at Sankir Technologies
Real User
2022-03-18T16:14:27Z
Mar 18, 2022
If I want to create a Databricks account, I need to have a prior cloud account such as an AWS account or an Azure account. Only then can I create a Databricks account on the cloud. However, if they can make it so that I can still try Databricks even if I don't have a cloud account on AWS and Azure, it would be great. That is, it would be nice if it were possible to create a pseudo account and be provided with a free trial. It is very essential to creating a workforce on Databricks. For example, students or corporate staff can then explore and learn Databricks. It's a big ask to have people jump through a lot of hoops to get approval to create a Databricks cluster just to explore it, but if they can try it on their own with a free trial without an underlying cloud account it would be more convenient. Documentation can be improved as well. There are so many versions of documents. For example, when I tried to create a DBU vault and secrets file, I had to go through multiple versions of documents. This could be improved so that the documentation is easy to use.
Data governance should be addressed. We have some trouble connecting all the governance solutions with Databricks. This means the integrative capabilities are problematic. The initial setup is difficult.
Machine Learning Engineer at a mining and metals company with 10,001+ employees
Real User
2021-11-03T23:41:00Z
Nov 3, 2021
The interface of Databricks could be easier to use when compared to other solutions. It is not easy for non-data scientists. The user interface is important before we had to write code manually and as solutions move to "No code AI" it is critical that the interface is very good.
Technical Architect at a tech services company with 10,001+ employees
Real User
2021-11-01T19:58:00Z
Nov 1, 2021
One area for improvement would be that anyone who doesn't know SQL may find the product difficult to work with. It would also be useful to have a remote support team inside Databricks, which would collect and analyze user feedback.
Practice Head, Data & Analytics at Tech Mahindra Limited
Real User
Top 10
2021-08-20T11:25:20Z
Aug 20, 2021
In my view, the fundamental approach of implementing Databricks is still very code heavy, more than you find in Azure Data Factory and other technologies like Informatica or SQL Server Integration Service. From my perspective, that could be improved. I'd also like to have the ability to facilitate predictive analytics within the solution.
Advanced Analytics Lead at a pharma/biotech company with 1,001-5,000 employees
Real User
2021-07-28T11:58:58Z
Jul 28, 2021
The solution could improve by providing better automation capabilities. For example, working together with more of a DevOps approach, such as continuous integration. There is a lot of code from places, such as GitHub, but it is not tailored for Databricks. It requires a lot of effort to bring the code to a level where it can be used with Databricks capabilities.
Lead Data Architect at a government with 1,001-5,000 employees
Real User
2021-04-21T14:10:02Z
Apr 21, 2021
The product is quite ambitious. It's trying to become a centralized platform for all data ingestion, transformation, and analytics needs. It may encounter a stiff competition from best of breed solutions powered by open source software. Overall it's a good product, however, it might get challenged over time with with individual best-of-breed products. For example in the area of Data Science, RStudio seems to be the industry standard at the moment. RStudio IDE is richer, there are a more out of the box functionalities like a push-button publishing, etc. It's more difficult to run R within Databricks. Especially when it comes to synchronizing the R packages, it legs behind. It's not even supporting the latest version of R 1.3. I believe eventually all analytics will converge into data science. The analytics of the future will be data science, because predicting the future will be one of the most prevalent use cases. The stuff we used to do before, slicing and dicing, drilling through, trend analysis, etc. will become redundant operations after the analytics toolsets become powered by AI/ML and fully automated. Unless the organisations acquire these platforms that can cater for machine learning and artificial intelligence, including natural language processing they will have a hard time surviving. With Databricks I would like to see more integration with and accommodation of open-source products. This could be controversial, as it could question the whole configuration and the purpose of the product. I'm pretty sure Microsoft is trying to position it in a monopoly market as they did with Windows and MS Office so that they could charge the premium. We are beginning to see the similar product strategy behind Databricks.
Chief Data-strategist and Director at Theworkshop.es
Real User
Top 10
2021-04-16T14:25:06Z
Apr 16, 2021
The solution works very well for us. I can't recall any missing features or anything the solution really lacks. It's very complete. It would help if there were different versions of the solution on offer. The integration of data could be a bit better.
Head of Data & Analytics at a tech services company with 11-50 employees
Real User
2020-12-08T10:26:21Z
Dec 8, 2020
There is definitely room for improvement. This is the type of solution where you need to have people with technical expertise to use it. Other products are self-service and can be employed by end-users. Databricks is not geared towards the end-user, but rather it is for data engineers or data scientists. I'm not sure whether Databricks is working towards it, or not. It would be nice if it were more user-friendly, where you don't have to rely on Power BI or a visualization tool. I know that there is integration in the notebook where you can do it, but still, the relationships and semantics make it more difficult. It would be better to do it right in Databricks. You could put them within the portal and I don't have to log out and bring that into Power BI and then visualize.
Data Scientist at a retailer with 5,001-10,000 employees
Real User
2020-11-02T23:28:50Z
Nov 2, 2020
Since the Databricks community is not that old, there is not a lot of information about some of the issues that we face. We have to go back to the Databricks stream to get some of the issue resolutions from there. As time passes, and more people start putting more information out there about this technology, wit will be helpful. I think even with the features that we currently have, they're still optimizing some of the clusters and trying to parallelize to better read from other types of data. So, that's going really well in terms of one of the features that they recently came up with to include the data format for data, which was really good, and that speeds up a lot of the processes. I would like to see more documentation in terms of how an end-user could use it, and users like me can easily try it and implement use cases.
Data Architect at a tech services company with 201-500 employees
Real User
2020-09-27T04:10:00Z
Sep 27, 2020
Sometimes we experience issues connecting our database to Databricks. There are no direct connectors — they are very limited. This should be addressed and corrected in the next release. Reading past data can also be tricky as there is no data spectrum like you would find with Snowflake and other solutions.
Chief Research Officer at a consumer goods company with 1,001-5,000 employees
Real User
2020-08-02T08:16:42Z
Aug 2, 2020
I'd like to see more licensing options for the solution, the availability of additional pricing tiers. I understand it's not easy to achieve because it's a kind of platform-as-a-service type of solution. If you wanted to be more specific about the parts, and what you might or might not need, then you could save some money, and go for a lower level. Of course, that would then mean you'd have to manage more configurations which, as a user, would make things more complex but it would be good to have that option. The pricing is not the cheapest but it's understandable because it's a very high-end solution and easy to use, there's a lot of complexity masked away. I would like to see additional monitoring tools and, in general, anything that can improve visualization of data. I know it's not the main point of Databricks and there are other tools that can be used, but anything that facilitates the integration of Databricks with visualization tools could be really useful. Increasing data scalability would also be great.
Instead of relying on a massive instance, the solution should offer micro partition levels. They're working on it, however, they need to implement it to help the solution run more effectively. They're currently coming out with a new feature, which is Date Lake. It will come with a new layer of data compliance.
Data Scientist at a energy/utilities company with 10,001+ employees
Real User
2020-02-09T08:17:00Z
Feb 9, 2020
I think the automatic categorization of variables needs to be improved. The current functionality is not always efficiently identifying the features of the data that is collected. Probably that is the only thing I can think of. Apart from that, I have not explored the product enough yet to go into more depth because there is only one asset project that I have taken on right now. Because I own this company, I have been doing more to run it than to explore this product very deeply. But when you get any form of data inside there, if it could understand what type of variables there are and what features the data has, it would help massively in taking processing to the next step. If it does not exactly identify the variables you may have to modify them a little. Apart from working with Databricks to understand its capabilities, I am also trying to learn Apache Spark right now. Some members of my team want to work with Apache Spark as a solution and at this point, we are evaluating both and we are planning to use Spark or Databricks. As far as what might be added, some custom algorithm samples would be useful. All of the other products of this type — Azure, AWS, SageMaker — they all have customizable algorithms. You have the capability to implement a sort of workflow from that by modifying things in the sample and changing it to fit your purposes. Probably that is something that might help in doing some small NDP (Near-Data Processing) development. It might not help in the project directly, but it will help while we work on some NDP development of our own so that we can quickly evaluate how something is going to work. Templates or other samples could make working on things easier. That would also help massively in getting people to understand the potential of what the product can actually do. But I also think not many people would strongly agree with this. Many people go to the first solution they can think of that they know very well already in the IT field even if they could imagine that something could be better. To get the value out of this technology, people will need to come to accept it. Technical people will accept Databricks more if they understand that this is something that they can use and start working on without a lot of experience. Adopting it will take time for new users who have no experience. But to feel like they can have success with a product, they have to execute something in a very short time and see how it can work. When you talk about AI — or really when you talk about anything new — people do not initially want to invest the time in discovery. These processes do take time to learn, but with templates or samples, you get to see immediately what the possibilities are and what you might get out of it. Then when they try something of their own and are able to get it working in less than a week's time, they will be encouraged to look into the product and the technology some more.
Vice President, Business Intelligence and Analytics at NTT Data India Enterprise Application Services Pri
Real User
2020-02-05T08:05:00Z
Feb 5, 2020
Pricing is one of the things that could be improved. Also, there could be improvement in the visual analytics space there and on the machine learning functions. I haven't explored so I don't know about the functions and features that are there. If it is not there, then I think that's something which they should consider including.
Engineer at a tech services company with 10,001+ employees
Real User
2020-02-04T09:59:56Z
Feb 4, 2020
The management of the solution needs to be modernized. Managing the radius data is hard. The solution requires modern scoring. There's not a good way of knowing how the models are performing from a data science perspective. The solution needs more model scoring abilities. It doesn't necessarily need more model monitoring, but more model scoring and performance from a data science perspective. Databricks is an analytics platform. It should offer more data science. It should have more features for data scientists to work with.
It would be very helpful if Databricks could integrate with platforms in addition to Azure. Having an open-source version or having the option to get a trial version of Databricks would be very helpful. It would be very useful for beginners if there were tutorials and examples on how to write code for PySpark, R, or Scala. Having examples would give people something to refer to and play with.
Machine Learning Engineer at a tech vendor with 51-200 employees
Real User
2019-12-25T08:21:00Z
Dec 25, 2019
The solution could be improved by integrating it with data packets. Right now, the load tables provide a function, like team collaboration. Still, it's unclear as to if there's a function to create different branches and/or more branches. Our team had used data packets before, however, I feel it's difficult to integrate the current with the previous data packets. The support could be improved a bit around the database. When we stream it to Data Lake, some data cannot be loaded. It should be a priority to fix this.
Data Science Developer at a tech services company with 501-1,000 employees
Real User
2019-12-11T05:40:00Z
Dec 11, 2019
Databricks should have more libraries for predictive analysis and machine learning. It should have more compatible and more advanced visualization and machine learning libraries. As it is now, I have to try a customer algorithm in order for things to be compatible. I would like to see more deep learning analytics.
Business Intelligence and Analytics Consultant at a tech services company with 201-500 employees
Consultant
2019-12-09T10:58:00Z
Dec 9, 2019
Some of the error messages that we receive are too vague, saying things like "unknown exception", and these should be improved to make it easier for developers to debug problems. As it is now, we have to go into the driver logs to identify the error messages properly. There is not much information about Databricks available online, such as cost. Whenever we want to find the actual costing, we have to send an email to Databricks, so having the information available on the internet would be helpful. I would like to see integration with Power BI or Tableau for the business users. They may use Databricks to check on things, but it will be a little bit complicated for them. The GUI interfaces for Tableau and Power BI are ones that they are used to, so the integration would help.
Improvements could include the pricing, the product is a little expensive, although I think comparable to other similar options. The integration features could be more interesting, more involved. For example, we use the Database Notebook, which is not as great as Jupyter Notebook, for providing a great user experience. The look and feel are not the same and we've had complaints from some of our users. They say that it's easier and more productive for them to use Jupyter Notebook. And then there is the integration feature for connecting to data sources, for example, Jupyter Notebook through publishes connect. The problem is that when you do that, you don't get all the Jupyter features which is a shame for us. For additional features, having some PyTorch or TensorFlow type features inside would definitely be great. For now, my users are developing for themselves by importing their libraries into their Notebook and then creating models based on the potential flow of PyTorch. That requires a lot of imports, particularly library imports, something that is now available in the new version of Machine Learning services. These things are very important because the self appliance community has shifted from the traditional way of preparing models, to a deeper learning system. It's now more common to have those features.
Data Scientist at a computer software company with 501-1,000 employees
Real User
Top 10
2019-10-14T12:39:00Z
Oct 14, 2019
The product could be improved by offering an expansion of their visualization capabilities, which currently assists in development in their notebook environment. Perhaps a few connectors that auto-deploy to a reporting server? More parallelized Machine Learning libraries would be excellent for predictive analytics algorithms.
Databricks is utilized for advanced analytics, big data processing, machine learning models, ETL operations, data engineering, streaming analytics, and integrating multiple data sources.
Organizations leverage Databricks for predictive analysis, data pipelines, data science, and unifying data architectures. It is also used for consulting projects, financial reporting, and creating APIs. Industries like insurance, retail, manufacturing, and pharmaceuticals use Databricks for data management...
Performance could be improved. It is crucial to check coding, configure Spark correctly, implement caching, and monitor performance metrics to enhance performance.
The product could be improved regarding the delay when switching to higher-performing virtual machines compared to other platforms like Snowflake. The ease and speed of managing clusters can also be enhanced, especially when scaling up resources. They could add more advanced data storage solutions like Iceberg and Delta files.
The biggest problem associated with the product is that it is quite pricey. We cannot find a better solution than Databricks in the market currently.
Databricks would have more collaborative features than it has. It should have some more customization for the jobs. Also, it has an average dashboarding tool. They can bring advanced features so we don't depend on other BI tools to build a dashboard. We are using Tableau to create a dashboard. If Databricks has more advanced features, we can entirely use Databricks.
The product should provide more advanced features in future releases.
The product should incorporate more learning aspects. It needs to have a free trial version that the team can practice.
There is room for improvement in visualization.
Scalability is an area with certain shortcomings. The solution's scalability needs improvement.
There is room for improvement in the documentation of processes and how it works. I was trying to get one of the certifications, so I saw an area of improvement there.
The tool should improve its integration with other products.
Databricks has added some alerts and query functionality into their SQL persona, but the whole SQL persona, which is like a role, needs a lot of development. The alerts are not very flexible, and the query interface itself is not as polished as the notebook interface that is used through the data science and machine learning persona. It is clunky at present.
Databricks' performance when serving the data to an analytics tool isn't as good as Snowflake's. In the next release, Databricks should include a better data-sharing platform to facilitate data sharing between companies.
The solution can be improved by expanding its integration capabilities and providing the ability to query external vendors directly.
Microsoft Azure has its learning environment on the Microsoft website. We can complete certifications, but the Databricks certification is more expensive than Microsoft. It costs between $2,000 and $2,500, and the knowledge is linked. They're also charged based on whether a person doesn't want to analyze large amounts of data. Hence, we want to have the capacity for free student users so that people can learn and build their professional skills.
I have had some issues with some of the Spark clusters running on Databricks, where the Spark runtime and clusters go up and down, which is an area for improvement. Still, I am generally unaware of any super-critical issues.
The area in which this product can be improved is optimization. In the next release, I would like to see more optimization features.
I would love an integration in my desktop IDE. For now, I have to code on their webpage. They provide a web interface to do my code. However, I have my local software to do some coding for other projects, yet I cannot use it for Databricks, and I lose all my shortcuts. I lose all the benefits from my local IDE. If one day they would provide some integrations with VS code, for example, that would be game-changing. Having Databricks in my VS code would be the most amazing feature.
The data visualization for this solution could be improved. They have started to roll out a data visualization tool inside Databricks but it is in the early stages. It's not comparable to a solution like Power BI, Luca, or Tableau. In a future release, we would like to have a better ETL designer tool to assist in the way we move data from one place to another.
CI/CD needs additional leverage and support. Community forums are helpful for gaining knowledge but the solution should provide specific documentation. Streaming services such as Flink should be amplified and better supported. There are not many connectors to MapReduce.
I'm struggling a little because I wanted to do some POC solutions. I present a lot of projects in various forums and seminars and there aren't a lot of credits and trial options with Databricks. Even if we want to explore, we're not able to and that's a challenge. The solution is quite expensive.
Databricks can improve by making the documentation better.
Every tool has room for improvement. Normally what happens, a solution will claim it can do ETL and everything else, but you encounter some limitations when you actually start. Then you keep on interacting with the vendor, and they continue to upgrade it. For example, we haven't fully implemented Databricks Unity Catalog, a newly introduced feature. We need to check how it works and then accordingly, there can be improvements in that also. Databricks may not be as easy to use as other tools, but if you simplify a tool too much, it won't have the flexibility to go in-depth. Databricks is completely in the programmer's hands. I prefer flexibility rather than simplicity.
Databricks could improve in some of its functionality.
The query plan is not easy with Databrick's job level. If I want to tune any of the code, it is not easily available in the blogs as well.
Support for Microsoft technology and the compatibility with the .NET framework is somewhat missing. There should be reliability between these two. Databricks is based on open sources. If it's more synchronous between the Microsoft technology and the programming languages, it'll be better. Python has better languages, but compatibility would be a great help. I would like to have better support for Microsoft technology and better language components. With Azure or Cosmo DB, I can store other data links or time series data tables. That would be a great help for analytics in real time.
I would like it if Databricks adopted an interface more like R Studio. When I create a data frame or a table, R Studio provides a preview of the data. In R Studio, I can see that it created a table with so many columns or rows. Then I can click on it and open a preview of that data. Because I work in analytics and not data engineering, I think that's probably the biggest one. There are better graphical tools, so I don't think Databricks can compete. You can do a simple graph, and it's not that great. However, I don't think they can ever stack up to Tableau, so it's probably not worth it to improve upon that.
If I want to create a Databricks account, I need to have a prior cloud account such as an AWS account or an Azure account. Only then can I create a Databricks account on the cloud. However, if they can make it so that I can still try Databricks even if I don't have a cloud account on AWS and Azure, it would be great. That is, it would be nice if it were possible to create a pseudo account and be provided with a free trial. It is very essential to creating a workforce on Databricks. For example, students or corporate staff can then explore and learn Databricks. It's a big ask to have people jump through a lot of hoops to get approval to create a Databricks cluster just to explore it, but if they can try it on their own with a free trial without an underlying cloud account it would be more convenient. Documentation can be improved as well. There are so many versions of documents. For example, when I tried to create a DBU vault and secrets file, I had to go through multiple versions of documents. This could be improved so that the documentation is easy to use.
Data governance should be addressed. We have some trouble connecting all the governance solutions with Databricks. This means the integrative capabilities are problematic. The initial setup is difficult.
The interface of Databricks could be easier to use when compared to other solutions. It is not easy for non-data scientists. The user interface is important before we had to write code manually and as solutions move to "No code AI" it is critical that the interface is very good.
One area for improvement would be that anyone who doesn't know SQL may find the product difficult to work with. It would also be useful to have a remote support team inside Databricks, which would collect and analyze user feedback.
In my view, the fundamental approach of implementing Databricks is still very code heavy, more than you find in Azure Data Factory and other technologies like Informatica or SQL Server Integration Service. From my perspective, that could be improved. I'd also like to have the ability to facilitate predictive analytics within the solution.
The solution could improve by providing better automation capabilities. For example, working together with more of a DevOps approach, such as continuous integration. There is a lot of code from places, such as GitHub, but it is not tailored for Databricks. It requires a lot of effort to bring the code to a level where it can be used with Databricks capabilities.
There should be better integration with other platforms.
The product is quite ambitious. It's trying to become a centralized platform for all data ingestion, transformation, and analytics needs. It may encounter a stiff competition from best of breed solutions powered by open source software. Overall it's a good product, however, it might get challenged over time with with individual best-of-breed products. For example in the area of Data Science, RStudio seems to be the industry standard at the moment. RStudio IDE is richer, there are a more out of the box functionalities like a push-button publishing, etc. It's more difficult to run R within Databricks. Especially when it comes to synchronizing the R packages, it legs behind. It's not even supporting the latest version of R 1.3. I believe eventually all analytics will converge into data science. The analytics of the future will be data science, because predicting the future will be one of the most prevalent use cases. The stuff we used to do before, slicing and dicing, drilling through, trend analysis, etc. will become redundant operations after the analytics toolsets become powered by AI/ML and fully automated. Unless the organisations acquire these platforms that can cater for machine learning and artificial intelligence, including natural language processing they will have a hard time surviving. With Databricks I would like to see more integration with and accommodation of open-source products. This could be controversial, as it could question the whole configuration and the purpose of the product. I'm pretty sure Microsoft is trying to position it in a monopoly market as they did with Windows and MS Office so that they could charge the premium. We are beginning to see the similar product strategy behind Databricks.
The solution works very well for us. I can't recall any missing features or anything the solution really lacks. It's very complete. It would help if there were different versions of the solution on offer. The integration of data could be a bit better.
The user experience can be improved. It's not easy to use, and they need a better UI.
Databricks requires writing code in Python or SQL, so if you're a good programmer then you can use Databricks.
Costs can quickly add up if you don't plan for it.
There is definitely room for improvement. This is the type of solution where you need to have people with technical expertise to use it. Other products are self-service and can be employed by end-users. Databricks is not geared towards the end-user, but rather it is for data engineers or data scientists. I'm not sure whether Databricks is working towards it, or not. It would be nice if it were more user-friendly, where you don't have to rely on Power BI or a visualization tool. I know that there is integration in the notebook where you can do it, but still, the relationships and semantics make it more difficult. It would be better to do it right in Databricks. You could put them within the portal and I don't have to log out and bring that into Power BI and then visualize.
Since the Databricks community is not that old, there is not a lot of information about some of the issues that we face. We have to go back to the Databricks stream to get some of the issue resolutions from there. As time passes, and more people start putting more information out there about this technology, wit will be helpful. I think even with the features that we currently have, they're still optimizing some of the clusters and trying to parallelize to better read from other types of data. So, that's going really well in terms of one of the features that they recently came up with to include the data format for data, which was really good, and that speeds up a lot of the processes. I would like to see more documentation in terms of how an end-user could use it, and users like me can easily try it and implement use cases.
I think we are using a lot of people to manage this solution. I'd like to see the people using this solution sharing their knowledge.
Sometimes we experience issues connecting our database to Databricks. There are no direct connectors — they are very limited. This should be addressed and corrected in the next release. Reading past data can also be tricky as there is no data spectrum like you would find with Snowflake and other solutions.
I'd like to see more licensing options for the solution, the availability of additional pricing tiers. I understand it's not easy to achieve because it's a kind of platform-as-a-service type of solution. If you wanted to be more specific about the parts, and what you might or might not need, then you could save some money, and go for a lower level. Of course, that would then mean you'd have to manage more configurations which, as a user, would make things more complex but it would be good to have that option. The pricing is not the cheapest but it's understandable because it's a very high-end solution and easy to use, there's a lot of complexity masked away. I would like to see additional monitoring tools and, in general, anything that can improve visualization of data. I know it's not the main point of Databricks and there are other tools that can be used, but anything that facilitates the integration of Databricks with visualization tools could be really useful. Increasing data scalability would also be great.
Instead of relying on a massive instance, the solution should offer micro partition levels. They're working on it, however, they need to implement it to help the solution run more effectively. They're currently coming out with a new feature, which is Date Lake. It will come with a new layer of data compliance.
I have seen better user interfaces, so that is something that can be improved. It was quite hard to deploy.
I think the automatic categorization of variables needs to be improved. The current functionality is not always efficiently identifying the features of the data that is collected. Probably that is the only thing I can think of. Apart from that, I have not explored the product enough yet to go into more depth because there is only one asset project that I have taken on right now. Because I own this company, I have been doing more to run it than to explore this product very deeply. But when you get any form of data inside there, if it could understand what type of variables there are and what features the data has, it would help massively in taking processing to the next step. If it does not exactly identify the variables you may have to modify them a little. Apart from working with Databricks to understand its capabilities, I am also trying to learn Apache Spark right now. Some members of my team want to work with Apache Spark as a solution and at this point, we are evaluating both and we are planning to use Spark or Databricks. As far as what might be added, some custom algorithm samples would be useful. All of the other products of this type — Azure, AWS, SageMaker — they all have customizable algorithms. You have the capability to implement a sort of workflow from that by modifying things in the sample and changing it to fit your purposes. Probably that is something that might help in doing some small NDP (Near-Data Processing) development. It might not help in the project directly, but it will help while we work on some NDP development of our own so that we can quickly evaluate how something is going to work. Templates or other samples could make working on things easier. That would also help massively in getting people to understand the potential of what the product can actually do. But I also think not many people would strongly agree with this. Many people go to the first solution they can think of that they know very well already in the IT field even if they could imagine that something could be better. To get the value out of this technology, people will need to come to accept it. Technical people will accept Databricks more if they understand that this is something that they can use and start working on without a lot of experience. Adopting it will take time for new users who have no experience. But to feel like they can have success with a product, they have to execute something in a very short time and see how it can work. When you talk about AI — or really when you talk about anything new — people do not initially want to invest the time in discovery. These processes do take time to learn, but with templates or samples, you get to see immediately what the possibilities are and what you might get out of it. Then when they try something of their own and are able to get it working in less than a week's time, they will be encouraged to look into the product and the technology some more.
Pricing is one of the things that could be improved. Also, there could be improvement in the visual analytics space there and on the machine learning functions. I haven't explored so I don't know about the functions and features that are there. If it is not there, then I think that's something which they should consider including.
The management of the solution needs to be modernized. Managing the radius data is hard. The solution requires modern scoring. There's not a good way of knowing how the models are performing from a data science perspective. The solution needs more model scoring abilities. It doesn't necessarily need more model monitoring, but more model scoring and performance from a data science perspective. Databricks is an analytics platform. It should offer more data science. It should have more features for data scientists to work with.
It would be very helpful if Databricks could integrate with platforms in addition to Azure. Having an open-source version or having the option to get a trial version of Databricks would be very helpful. It would be very useful for beginners if there were tutorials and examples on how to write code for PySpark, R, or Scala. Having examples would give people something to refer to and play with.
The solution could be improved by integrating it with data packets. Right now, the load tables provide a function, like team collaboration. Still, it's unclear as to if there's a function to create different branches and/or more branches. Our team had used data packets before, however, I feel it's difficult to integrate the current with the previous data packets. The support could be improved a bit around the database. When we stream it to Data Lake, some data cannot be loaded. It should be a priority to fix this.
Databricks should have more libraries for predictive analysis and machine learning. It should have more compatible and more advanced visualization and machine learning libraries. As it is now, I have to try a customer algorithm in order for things to be compatible. I would like to see more deep learning analytics.
Some of the error messages that we receive are too vague, saying things like "unknown exception", and these should be improved to make it easier for developers to debug problems. As it is now, we have to go into the driver logs to identify the error messages properly. There is not much information about Databricks available online, such as cost. Whenever we want to find the actual costing, we have to send an email to Databricks, so having the information available on the internet would be helpful. I would like to see integration with Power BI or Tableau for the business users. They may use Databricks to check on things, but it will be a little bit complicated for them. The GUI interfaces for Tableau and Power BI are ones that they are used to, so the integration would help.
Improvements could include the pricing, the product is a little expensive, although I think comparable to other similar options. The integration features could be more interesting, more involved. For example, we use the Database Notebook, which is not as great as Jupyter Notebook, for providing a great user experience. The look and feel are not the same and we've had complaints from some of our users. They say that it's easier and more productive for them to use Jupyter Notebook. And then there is the integration feature for connecting to data sources, for example, Jupyter Notebook through publishes connect. The problem is that when you do that, you don't get all the Jupyter features which is a shame for us. For additional features, having some PyTorch or TensorFlow type features inside would definitely be great. For now, my users are developing for themselves by importing their libraries into their Notebook and then creating models based on the potential flow of PyTorch. That requires a lot of imports, particularly library imports, something that is now available in the new version of Machine Learning services. These things are very important because the self appliance community has shifted from the traditional way of preparing models, to a deeper learning system. It's now more common to have those features.
The product could be improved by offering an expansion of their visualization capabilities, which currently assists in development in their notebook environment. Perhaps a few connectors that auto-deploy to a reporting server? More parallelized Machine Learning libraries would be excellent for predictive analytics algorithms.