What is our primary use case?
The product monitors multiple systems, from customer interactions on our web applications down to the database and all layers in between. RUM, APM, logging, and infrastructure monitoring are all surfaced into single dashboards.
We initially started with application logs and generated long-term business metrics out of critical logs. We have turned those metrics and logs into a collection of alerts integrated into our pager system. As we have evolved, we have also used APM and RUM data to trigger additional alerts.
How has it helped my organization?
The solution has surfaced how integrated our applications really are and helps us track calls from the top down, identifying slowness and errors all through the call stack.
The biggest improvement we have seen is our time to discovery and resolution. As Datadog has improved, and we add new features, the depth and clarity we get from top to bottom has been excellent. Our engineering teams have quickly adopted many features within Datadog, and are quick to build out their own dashboards and alerts. This has also led to a rapid sprawl when left unchecked.
What is most valuable?
We started with application logs and have expanded over the years to include infrastructure, APM, and now RUM. All of these tools have been incredibly valuable in their own sphere. The huge value is tying all of the data points together.
Logging was the first tool we started with years ago, replacing our ELK stack. It was the easiest to get in place, and our engineers quickly embraced the tools. Several critical dashboards were created years ago and are still in use today. Over time, we have shifted from verbose logs and matured into APM and RUM. That has helped us focus on fine-tuning the performance of our applications.
What needs improvement?
We need better visibility into our consumption rate, which is tied to our commit levels. We would love to see a % consumed and alert us if we are over budget before getting an overage charge 20 days into the month.
The biggest complaint we hear comes from the cost of the tool. It is pretty easy to accidentally consume a lot of extra data. Unless you watch everything come in almost daily, you could be in for a big surprise.
We utilize the Datadog estimated usage metrics to build out alerts and dashboards. The usage and cost system page still doesn't tie into our committed spending - it would be wonderful to see the monthly burn rate on any given day.
For how long have I used the solution?
I've used the solution for six years.
What do I think about the stability of the solution?
There have not been as many outages in the past year. We also haven't been jumping into the new features as quickly as they come out. We may be working on more stable products.
What do I think about the scalability of the solution?
It has scaled up to meet our needs pretty well. Over the years, we have only managed to trigger internal DataDog alerts once or twice by misconfiguring a metric and spiralling out of control with costs.
How are customer service and support?
Support has been lacking. Opening a chat with the tech support rep of the day is always a gamble. We are looking into working with third-party support because it has been so rough over the years.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We used the ELK stack for logging and monitoring and AppDynamics for APM.
How was the initial setup?
The initial setup for new teams has become easier over the years. We are increasing our adoption rate as we shift our technology to more cloud-native tools. Datadog has supported easy implementation by simply adding a package to the app.
They have really focused on a lot of out-of-the-box functionality, but the real fun happens as you dive deeper into the configuration. We have also begun adapting open telemetry standards. This has kept us from going too deep into vendor-specific implementations.
What about the implementation team?
We did the initial setup via an in-house team.
What was our ROI?
As long as we stay on top of our consumption mid-month, it has been worth it. However, the few engineers we have who are dedicated to playing whack-a-mole with the growing spending could be better utilized in teaching best practices to new users. I suppose our implementation of the rapidly changing tools over the years has led to a fair amount of technical debt.
What's my experience with pricing, setup cost, and licensing?
It is quite easy to set up any specific tool, but to take advantage of the full visibility it offers, you need to instrument across the board—which can be time-consuming. Be careful about how each tool is billed, and watch your consumption like a hawk.
Which other solutions did I evaluate?
What other advice do I have?
It's a very powerful tool, with lots of new features coming, but you certainly will pay for what you get.
Which deployment model are you using for this solution?
Public Cloud
Disclosure: My company does not have a business relationship with this vendor other than being a customer.