We use it in a few different ways:
- For general monitoring of operating systems.
- Leveraging some customized offerings, specifically for creating application monitoring.
- Some external site-to-site monitoring in various places, ensuring that our websites and external pieces are available over an Internet connection.
It has given us a clearer view into our environment because it's able to look in and pull things off of the event viewer or log files. We have been able build dashboards and drill down on things, which has helped improve our time to respond. Also, in the case of specific conditions being met in X log, we have been able to get in and take a look at that a lot faster rather than trying to connect and parse through the log and figure it out. It's able to flag that and work us towards a solution faster than normal.
We have a few custom data sources that we have defined, especially for our application. It is able to leverage a specific data source and build monitoring rather than just having it be a part of the general monitoring. It is segmented and customized for what we actually need, which has been pretty helpful.
Custom data sources have given us a bit more information from a point in time and historically viewpoint. In the console, it is easy to compare week-over-week or month-over-month traffic and numbers. As changes are made in the environment, we can look and have better historical knowledge, and say, "We started seeing this spike three months ago and this is the change we made," or, "We started seeing this CPU usage reduced after the last patch or software update." It lets us be able to compare and get a better insight into the environment over a longer period, rather than just at a point in time, when investigating an issue.
The solution has allowed us to have specific alerting for specific messages. If we know that X messages on a notification let us know this state has happened, we can then set that to be either an email notification or a tracking notification. In the cases of a log meaning that we have a specific issue, we can have it send an email and let us know. Thus, we have a better, faster response. We also have integrations with PagerDuty, which allows us to be able to make things very specific as to the level of intervention and the specific timing of that intervention. It has been nice to be able to customize that down to even a message type and timing metric.
The solution’s ability to alert us if the cloud loses contact with the on-prem collectors has been helpful to know. E.g., if we are having an issue with our Internet connection or some of our less monitored environments, such as our lower environments in different data centers where we don't have as heavy of monitoring. Therefore, it's helpful to have that external check there versus our production environments which are heavily monitored. Typically, we are intervening before it times out to say that it's lost the connection. It's been helpful to have that kind of information. This way, we know either via a page or email if there is any sort of latency or a timing issue with it connecting to the cloud. It's been helpful that it's not just a relying on the Internet connection at our site, but is able to see into our environment, then it monitors when there are connectivity or timeout issues.
We use it for anomaly detection because our software is designed to function in a specific way. Therefore, anomaly detection is helpful when there are issues that may not be breaking the software but when it is running in a nonstandard way, then we can be alerted and notified so we can jump on that issue. Whether the issue will be fixed it in the moment or handed off to development to find a solution, it's helpful to have that view into how it's running over the long-term.
It is a pretty robust solution. There are a lot of customizations that you can put in for what you want it to be checking, viewing, and alerting on. As we get alerting and realize that that's not something we need to be alerted on or it happens to be normal behavior, a lot of that information can be put back into the system, to say, "Alright, this may look like an anomaly, but it isn't." Therefore, we can customize it so it gets smarter as it goes on, and we're really only being notified for actual issues rather than suspected issues.
It's been helpful to be able to have some information to be able to pass along to development that's very specific as to what the issues are. E.g., we can see an anomaly during periods of time while this is running, then pass that along so development can figure out, "Is it a database issue, an application issue, or possibly a DNS level issue?" They also determine if there are further things that need to be dug into or if it is something that can just be fixed by a code change.
The solution’s automated and agentless discovery, deployment, and configuration seems to work pretty well for standard pieces, like Windows servers and your standard hardware. It has been able to find and add those piece in. Normally, if I'm running into an issue with finding something, it's usually because it's missing a plugin or piece that just needs to be implemented, which just needs to be added in manually. However, 99 percent of the time, it finds things automatically without a problem.