Datadog provides us with a solution for data ingesting for all of our application metrics, resource metrics, APM/tracing data etc.
We use it for use in dashboards, monitoring/alerting, SLO targets, incident response etc.
We have a lot of applications across multiple languages/frameworks etc., and have deployed in Kubernetes across multiple regions in AWS, along with underlying managed resources such as SQS, Aurora, etc.
Datadog makes understanding the state of these seamless. We are a company with millions of daily active users, and this level of detail is excellent.
Datadog has allowed us to rapidly spin up alerting and monitoring that helps our incident responders get alerted quickly when our SLOs are in danger and helps to quickly resolve issues.
It is the single most important tool we have from an SRE perspective.
It also provides us with an easy way to get information at a glance for all of our services through APM and create unified dashboards that track our underlying resources, such as databases, queues, etc., alongside application data.
It has been invaluable to our organization.
The management of SLOs and their related burn-rate monitors have allowed us to onboard teams to on-call fast.
Management of resources using infrastructure-as-code has been a recent game-changer for us. Combining the two has allowed us to provide product teams with a total solution for getting their applications attached to user-focused alerting and monitoring within a matter of days rather than months - and has clearly impacted our ability to discover and respond to significant production incidents.
Managing dashboards as IaC is a bit hard to work out at times. I use custom tools to convert JSON dashboards to Terraform resources. Ideally, I'd like for some sort of building tool for this to be built into the app. For example, a templating system that can easily be exported to IaC would be transformative for us.
There are also some aspects of the API that can be a bit verbose - especially in the area of new features like SLOs - and take some time to understand. That said, overall, they're well-documented enough to be a minor concern for us.
I've been using the solution for over five years.
I have never seen a major outage that prevented us from using Datadog, although I can't speak for other teams/time zones
This product is massively scalable - I haven't seen any issues as we continue to onboard new technologies and teams
Datadog provides us with a number of direct lines to support, although I haven't personally required their assistance.
We previously used LightStep for APM and switched to Datadog to unify all of our application data.
Most elements are quite simple to set up. However, some types of data collection require organization-wide engineering buy-in.
We handled the initial setup in-house.