What is our primary use case?
We use it to monitor and alert our ECS instances as well as other AWS services, including DynamoDB, API Gateway, etc.
We have it connected to Pagerduty for alerting all our cloud applications.
We also use custom RUM monitoring and synthetic tests for both our internal and public-facing websites.
For our cloud applications, we can use Datadog to define our SLOs, and SLIs and generate dashboards that are used to monitor SLOs and report them to our senior leadership.
How has it helped my organization?
Datadog has been able to improve our cloud-native monitoring significantly, as CloudWatch doesn't have enough features to create robust, sustainable dashboards that are easily able to present all the information in an aggregated manner in one place for a combination of applications, databases, and other services including our UI applications.
RUM monitoring is also something we didn't have before Datadog. We had Splunk, which was a lot harder to set up than Datadog's custom RUM metrics and its dashboards.
What is most valuable?
I really enjoy the RUM monitoring features of Datadog. It allows us to monitor user behavior in a way we couldn't before.
It's useful to be able to obfuscate sensitive information by setting up custom RUM actions and blocking the default ones with too much data.
I also like being able to generate custom metrics and monitors by adding facets to existing logging. Datadog can parse logs well for that purpose. The primary method of error detection for our external website is synthetic tests. This is extremely valuable for us as we have a large user base.
What needs improvement?
At times, it can be hard to generate metrics out of logs. I've seen some of those break over time and have flakey data available.
Creating a monitor out of the metric and using it in a dashboard to generate our SLIs and SLOs has been hard, especially in cases where the data comes from nested logging facets.
For how long have I used the solution?
I've used the solution for two years.
What do I think about the stability of the solution?
The stability is pretty good.
What do I think about the scalability of the solution?
The solution is pretty scalable! It's hard to set up all the infra (terraform code) required to link private links in Datadog to all of our different AWS accounts.
How are customer service and support?
They offer good support. Solutions are provided by the team when needed. For example, we had to delete all our RUM metrics when we accidentally logged sensitive data and the CTO of Datadog stepped in to help out and prioritize it at the time.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We previously used Splunk and some internal tools. We switched due to the fact that some cloud applications don't integrate well with pre-existing solutions.
How was the initial setup?
The initial setup for connecting our different AWS accounts via Datadog private link wasn't great. There was a lot of duplicate terraform that had to be written. The dashboard setup is way easier.
What about the implementation team?
We installed it with the help of a vendor team.
What was our ROI?
Our return on investment is great and is so much better than CloudWatch. We can easily integrate with Pagerduty for alerting.
What's my experience with pricing, setup cost, and licensing?
Our company set up the product for us, so the engineers didn't need to be involved with pricing.
The pricing structure isn't very clear to engineers.
Which other solutions did I evaluate?
We looked into Splunk and some internal tools.
Which deployment model are you using for this solution?
Private Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: I am a real user, and this review is based on my own experience and opinions.