I provide product management and SME services to oil companies as a consulting service. My company has partnered with SparkCognition to bundle its products into a package of services that I provide to my customers. For the most part, when I'm working with SparkCognition, and Darwin in particular, I'm working with it on behalf of one of my customers.
We do different engagements. We've done PoC projects with customers with versions 1.4 and onward.
The biggest use case we've seen is for automatic classification of data streaming in from oil and gas operations, whether exploration or production. We see the customers using it to quickly and intelligently classify the data. Traditionally, the way that would be done is through a very complicated branching code which is difficult to troubleshoot, or by having it manually done with SMEs or people in the office who know how to interpret the data and then classify it, for analytics.
The customers have looked at using machine learning for that, but they run into challenges — and this is really what Darwin is all about. Typically there is an SME who can look at the data and properly classify it or identify problems, but taking what he knows and what he does instinctively and communicating it to a data scientist who could build a model for that is a very difficult process. Additionally, data scientists are in very high demand, so they're expensive.
SMEs can look at data and quickly make interpretations. They've probably been looking at the data for 10 or 15 years. So it's not a matter of just, "Oh, we can plunk this SME beside a data scientist and in a couple of months they can turn out a model that does this." First, SMEs don't have time to be pulled out of their normal workload to educate the data scientists. And second, even if they do that you end up with something very rigid
With Darwin, customers can empower the SMEs to build the models themselves without having to go through the process of educating the data scientists, who may leave next week for a better paying job.
Most of the projects that we've done, PoCs, are typically done in the cloud, for ease of use. Because we work in the oil and gas space, public cloud is the preferred option in the U.S., with the simplified administration and a little bit lower cost. Overseas, the customers we've talked to have noted there are laws and restrictions that require their stuff to be on-premise. We've talked to potential customers about it, but we haven't actually done an on-premise project so far.
The automated AI model-building reduces the time that projects take. Before I started working with SparkCognition, I worked on several projects where it took months, and in some cases years — complex problems — for data scientists to even pick a machine-learning model to use. They might settle on a methodology such as random forest after quite a bit of analysis. Whenever a model is completed, it is a powerful and unique solution that can't be done with traditional programming, but it's almost impossible to tune in the field. Additionally, if you're talking oil and gas, some of the sites where you need to run these, especially on the edge, are very remote. It doesn't respond the way you want. So then you have to take it back to a data scientist and have it tuned.
Darwin lets you rerun the process with new data, with more data, with different tuning parameters. You don't have any of that back-and-forth with a physical person.
The solution has created the opportunity for machine learning to be practically implemented in places where it couldn't be implemented before. The current way that machine learning problems are implemented is with data scientists, usually as IT initiatives or R&D initiatives. Often, a company will say, "Okay, we're going to do machine learning." They have a big initiative, then hire some very expensive data scientists and they create a model that may be 10 or 15 percent better than what's out there. The challenge is that the model exists in MATLAB or Python but it's not integrated into the business systems like an ERP; or it's not integrated into their industrial control system. It ends up being a really cool PoC and it never turns into something that practically affects the business.
Darwin has opened up the places where you can do that.
The potential we see with Darwin, with the REST API and with the easy-to-approach interface, is that we can empower these SMEs to build things and interface them from day one with the existing control systems and other systems the business is using. So they're not stepping out of their traditional workflows to use machine learning. It's integrated.
As far as building models goes, versus hiring people with ML experience it is significantly faster and exponentially cheaper. You can build models, once you've done some initial training in Darwin, that would take a data scientist, with an SME, two or three months to build. With Darwin the SME can do that in a few days or less. For a lot of applications, especially in oil and gas, the savings are huge as far as practical applications of machine learning go, versus the tradition of using data scientists to build them one by one.
For our customers we have primarily looked at use cases around automatic event detection. We hadn't even tried to do that with data scientists because it just wasn't practical with the timeline and because the costs were too high. And using traditional software methods to try to solve that problem, the estimate that the customer had was that it would have taken three to four months of software development. We were able to build a model that provided effectively the same results within a week. A lot of that was just figuring out the data and data quality issues. The actual model building in Darwin took a few hours.