I have been in my current role for the past 18 months, and we started using PagerDuty Operations Cloud earlier this year around January or February to manage our operations.
PagerDuty Operations Cloud's primary use case is alerting. We switched to ensure alerts are efficient and effective so that the on-call engineer does not miss any alert. We instrument many alerts on it, including VPN downtime, transaction monitoring, success rate, and latency. We configured PagerDuty Operations Cloud so that if any of those metrics are met or if any of those SLOs and SLIs are breached, we can quickly take action and resolve the issue. For day-to-day use, we run a 24-hour shift where all shifts are entered into the system, and every on-call engineer uses PagerDuty Operations Cloud to receive alerts. Beyond alerting, we also use scheduling, incident management, and incident reports.
The best features of PagerDuty Operations Cloud include alerting, which is very important and the main reason we retain it, and scheduling as well.
Initially, we used Excel to manage our on-call engineers' schedules, but with PagerDuty Operations Cloud, it shows when you are on duty and allows other team members from different teams to check who is on duty without needing to ask. This has significantly reduced the time spent on checking who is on duty by providing visibility at each point.
Scheduling with PagerDuty Operations Cloud has reduced confusion because we set it up with a round-robin rotation, and nobody needs to update it every day unlike with Excel, where we had to create a new schedule every two months. Now we only make changes when necessary, making the process more efficient and organized for on-call engineers to know when they are on duty. The system also alerts them in advance for their upcoming shifts.
One way PagerDuty Operations Cloud could improve is through the scheduling feature, which can be tricky, especially with complex schedules. I have found it stressful to schedule effectively, even after going through PagerDuty University and the forums. Sometimes I need to manually interchange people because minor changes can scatter the whole schedule. A more efficient scheduling system or better guidance for complex schedules would help.
Another area for improvement is alerting. When multiple incidents occur simultaneously, it would be helpful if alerts listed the issues instead of muddling them together. This would make it easier to manage what needs urgent attention without missing anything.
Initially, when I first joined the company, we primarily used Grafana and Slack as our means to manage incidents. The alert was on Slack, and the dashboard was on Grafana, which required us to use three different applications to do the same thing.
With PagerDuty Operations Cloud now, we don't need to go through multiple tools to manage alerts and incidents. We don't need to go through Jira to log incidents. It streamlines the process, and with incident management, it can escalate to the next person so that alerts are rarely missed. It has made our workflow easier and much more efficient.
For incident management in my team, PagerDuty Operations Cloud has really helped with alerting in such a way that when an issue happens, it reaches out to the on-call engineer to ensure they don't miss it. There is a pop-up, probably on your browser or phone, and if you miss the pop-up or don't acknowledge it in time, it moves to your phone and starts calling; sometimes it sends texts and sometimes calls your phone. The call is very persistent, so if the incident is not acknowledged, it escalates to the next line, which can be your manager or your functional manager, and it keeps escalating until it gets acknowledged. This way, the alert is rarely missed because at some point, somebody will surely pick up.
PagerDuty Operations Cloud helps us effectively manage incidents without needing to sit down all day and watch our screens.
Alerting is key, and scheduling is also important but not as crucial as alerting. We also use incident management and incident reporting, which allow us to manage who should be escalated to during incidents and keep track of when incidents happen and when they are resolved so that everyone knows what occurred and how it was handled.
PagerDuty Operations Cloud has positively impacted my organization by providing effectiveness and efficiency in the way we work, with less alert fatigue, meaning alerts are rarely missed. For example, if four or five alerts come on Slack at the same time, you might miss them while focusing on resolving current issues. However, with PagerDuty Operations Cloud, since it calls for every issue, you will see any new alerts and resolve them, thus reducing missed alerts and increasing efficiency. This leads to better service for our end users, increased profit, and less pressure on engineers, making it a win-win for everybody.
Our MTTR has significantly reduced; however, I cannot provide specific numbers because with Slack, we were not measuring it accurately. Now with PagerDuty Operations Cloud, we can measure how long it takes to acknowledge alerts and resolve issues, giving us metrics to manage this effectively.
My advice for others looking into using PagerDuty Operations Cloud is that if their workflow requires them to be alert to incidents while continuing their work without being tethered to a screen, it is a very helpful tool to have.
One additional thought about PagerDuty Operations Cloud is that if they started issuing certificates for completing courses on PagerDuty University, it would encourage more people to engage with the training, similar to how New Relic operates. Having a certificate would demonstrate rigorous training and the capability to apply what was learned. I would rate this product a 6 out of 10.
Nice work Jeremy