We have it integrated into our incident management system. We also have it integrated into a homegrown alerting and monitoring solution, where it does some automation and self-healing behind the scenes.
We are working on an email integration for our service desk, similar to how xMatters themselves have it set up.
It provides incident notifications, subscription notifications, etc.
We use it for triggered tasks or events. Whenever a high ticket is created, it automatically notifies whomever is on call for the ticket that is assigned to a particular group, which was really one of our first use cases for it.
We saw the value by being able to import everyone's schedule into one common central repository and have one tool for all the operational teams, or any team for that matter. It gave us the technology to find out who is on call. The incident management of xMatters' integration was another key aspect, where we could say, "You can configure this when a high ticket fires."
We had people who would say, "Oh, I didn't get that phone call," or, "I didn't hear that message." The level of logging within xMatters is pretty extensive, which has allowed us to confirm or deny if someone is saying, "Hey, I didn't get that message." It says right here in the log that you not only got it, but you answered it and hung up halfway through the message. That was a little bit of a game changer for us because it gave us the ability to validate whether or not these messages were going out. This wasn't much of a problem previously, but it has been just another tool in our tool belt to be able to confirm that this stuff has been working as expected. It puts the onus on the engineering and development teams to respond when they have been being paged or notified.
I use xMatters logs on the operational side. The logs are not really something that the other teams use as much. We use it to just make sure the notifications are going out and being delivered successfully to individuals or teams when we are sending them out. I get a rare call or request from someone on the apps teams, to say, "Can you show me a little bit of the reporting to show me how many times that my team was notified or paged from xMatters since January?" Then, I will go in and show them how they can run those reports, but also get that data for them. They may be trying to justify additional headcount next year, or something along those lines, e.g., some teams get contacted more often than others and these teams seem to always get contacted more." They are looking for anything, which they can take advantage of, to show the volume of work or amount of times that they are getting called.
I have some folks in our reliability engineering team who have taken advantage of xMatters and integrated it with a couple of our monitoring systems, then wrote some custom code to do some notifications. It not only can receive incident data from Jira, but it can also reverse that workflow and create incidents based off of different alerts triggered from external services. So, they will see an alert fire, create a Jira incident, notify the team that is responsible for resolving that issue, and then record that acceptance or decline from that notification into the ticket. It then essentially correlates those events. In a couple of cases, we have even had some help via self-healing or automation that would kick off and run like a script to recycle a server, cloud instance, or something automatically based on that alert. After that is done, it will do a validation check. If the service is responding as expected, then it will automatically close out the ticket.
We have some standards in place for technology. These go back over 10 years, even before xMatters. Having a tool that keeps it all in context has helped. It does automatic escalation, so we bake that into whatever the on-call team is. It will contact the primary, waiting 10 minutes and contacting the secondary, then waiting another five minutes and contacting the manager, and finally waiting five more minutes and contacting the director. That has been the standard for over a decade. In the past, it required a human to do that, so maybe 10 minutes was actually 12 minutes after the first wait time. Since being automated, there has been a level of consistency. It knows, "My wait time's up. I will go onto the next person."
It has the ability to decline. Thus, if anybody in the escalation path is unavailable, then they can hit the "Decline" option. It then circumvents that wait period. It knows, "Okay, I'm just going to go ahead and call this next person right away." That is not something that we had with the manual condition. We would need to talk to the person, wait and get their voicemail, and then wonder if they were available or not. In some instances, it has expedited the escalation. The solution hasn't really moved the needle too much on the technology here. It just streamlines it a little bit and makes a slight improvement on an existing process.
We have incorporated xMatters into our application delivery workflows for notification purposes. When deployments are made or going to be made, whether they are in a scheduled status, in progress, or completed, we leverage notifications to notify people that something has been done, is being done, or will be done. From a notification perspective, it posts messages to various teams and channels based on the condition or status of that deployment. We don't have it integrated in the pipeline itself.