Our use case: Planning for sizing servers as we move them to the cloud. We use it as a substitute for VMware DRS. It does a much better job of leveling compute workload across an ESX cluster. We have a lot fewer issues with ready queue, etc. It is just a more sophisticated modeling tool for leveling VMs across an ESX infrastructure.
It is hosted on-prem, but we're looking at their SaaS offering for reporting. We do some reporting with Power BI on-premise, and it's deployed to servers that we have in Azure and on-prem.
The proactive monitoring of all our open enrollment applications has improved our organization. We have used it to size applications that we are moving to the cloud. Therefore, when we move them out there, we have them appropriately sized. We use it for reporting to current application owners, showing them where they are wasting money. There are easy things to find for an application, e.g., they decommissioned the server, but they never took care of the storage. Without a tool like this, that storage would just sit there forever, with us getting billed for it.
The solution handles applications, virtualization, cloud, on-prem compute, storage, and network in our environment, everything except containers because they are in an initial experimentation phase for us. The only production apps we have which use containers are a couple of vendor apps. Nothing we have developed, that's in use, is containerized yet. We are headed in that direction. We are just a little behind the curve.
Turbonomic understands the resource relationships at each of these layers (applications, virtualization, cloud, on-prem compute, storage, and network in our environment) and the risks to performance for each. It gives you a picture across the board of how those resources interact with each other and which ones are important. It's not looking at one aspect of performance, instead it is looking at 20 to 30 different things to give recommendations.
It provides a proactive approach to avoiding performance degradation. It's looking at the trends and when is the server going to run out of capacity. Our monitoring tools tell us when CPU or memory has been at 90 percent for 10 minutes. However, at that point, depending on the situation, we may be out of time. This points out, "Hey, in three weeks, you're not going to be looking good here. You need to add this stuff in advance."
We are notifying people in advance that they will have a problem as opposed to them opening tickets for a problem.
We have response-time SLAs for our applications. They are all different. It just depends on the application. Turbonomic has affected our ability to meet those SLAs in the ability to catch any performance problems before they start to occur. We are getting proactive notifications. If we have a sizing problem and there's growth happening over a trended period of time that shows that we're going to run out of capacity, rather than let the application team open a ticket, we're saying, "Hey, we're seeing latency in the application. Let's get 30 people on a bridge to research the latency." Well, the bridge never happens and the 30 people never get on it, this is because we proactively added capacity before it ever got to that point.
Turbonomic has saved human resource time and cost involved in monitoring and optimizing our estate. For our bridges, when we have a problem, we are willing to pay a little bit extra for infrastructure. We're willing to pull a lot more people than we're probably going to need onto our bridge to research the problem, rather than maybe getting the obvious team on, then having them call two more, and then the problem gets stretched out. We tend to ring the dinner bell and everybody comes running, then people go away as they prove that it's not their issue. So, you could easily end up with 30 to 40 people on every bridge for a brief period of time. Those man-hours rack up fast. Anything we can do to avoid that type of troubleshooting saves us a lot of money. Even more importantly, it keeps us productive on other projects we're working on, rather than at the end of the month going, "We're behind on these three projects. How could that have happened?" Well, "Remember there was that major problem with application ABC, and 50 people sat on a bridge for three days for 20 hours a day trying to resolve it."
In some cases you completely avoid the situation. A lot of our apps are really complex. A simple resource add in advance to a server might save us from having a ripple effect later. If we have a major application, as an example, and to get data for that application, it calls an API in another application, then pulls data from it. Well, the data it asks for: 80 percent of it's in that app, but 20 percent of it's in the next app. There is another API from that call to get that data to add it to the data from application B to send it back to application A. If you have sometimes a minor performance problem in application C that causes an outage in application A, which can be a nightmare to try and diagnose those types of problems, especially if those relationships aren't documented well. It is very difficult to quantify the savings, but If we can avoid problems like that, then the savings are big.
We are using monitoring and thresholds to assure application performance. It is great, but at the point where our monitoring tools are alerting, then we already have a problem in a lot of cases, though not always. The way we have things set up, we get warnings when resource utilization reaches 80 percent, because we try to keep it at 70 percent. We get alerts, which is kind of like, "Oh no," but we can do something about it when the applications are at 90 percent. The problem is there are so many alerts and it's such a huge environment. Because there is too much work going on, they get ignored. So, they can work into the 90s, and you end up a lot more often in a critical state. That's why the proactive monitoring of all our open enrollment stuff is really beneficial to us.