What is our primary use case?
We primarily use Datadog for logs, APM, infrastructure monitoring, and lambda visibility.
We have built a number of critical dashboards that we display within our office for engineers to have a good understanding of the application performance, as well as business partners to understand at a high level the traffic flowing through the app.
We started with logging, as our primary monitor, and have shifted to APM to get a deeper understanding of what our system is doing, and how the changes we are making impact the apps.
How has it helped my organization?
Datadog has given us near-live visibility across our entire cloud platform. We are finally in a state where we are alerting our users about degraded performance well before the helpdesk tickets start rolling in.
We are making major architectural decisions based on the data we are getting from Datadog. It also gives us an idea of where the complexity really lies in some older, monolithic apps.
We have used the APM endpoint monitoring to prioritize work on slower endpoints because we can see the total count, as well as the latency. That has been a big driver in our refactor work prioritization.
We have struggled to get more business-centric measures in our code to surface actual business values in our reports, but that is our next initiative.
What is most valuable?
We started with Log analytics in the beginning stages of our monitoring journey. Those were very insightful, but obviously only as useful as we made them with good logging practices.
The dashboards we created are core indicators of the health of our system, and it is one of the most reliable sources we have turned to, especially as we have seen APM metrics impacted several times lately. We can usually rely on logs to tell us what the apps are doing.
APM and Traces have been crucial to understanding how users are actually using the app. That drives a lot of our decisions around refactoring and focusing our limited engineering resources.
What needs improvement?
Continued improvement around cost and pricing model is needed. It is pretty complex and takes a fair amount of intimate knowledge to know exactly how turning on a single function is going to impact your bill, especially when you don't see the metrics for a day or two.
We have recently had a number of issues with stability and delays on logging, monitoring, metric evaluation, and alerts. More often than not in the past month, it seems that we get the banner across the to of our dashboards that some service is impacted. They don't always show up on the incident page, either.
For how long have I used the solution?
We have been using Datadog for two years.
What do I think about the stability of the solution?
Overall, it has been fairly stable for us. There are the occasional issues with importing data, that has usually been resolved in a short time. We have never had an issue where that data was lost, just delayed, and eventually backfilled.
It seems (anecdotally, of course) that there have been a few more stability issues lately. We have noticed several days that we are getting in-app alert banners indicating that some metric or log ingestion was delayed, or the web app itself was experiencing severe slowness.
Overall, these issues are resolved rather quickly - kudos to their engineering teams. I hear that they actually use Datadog to monitor Datadog.
What do I think about the scalability of the solution?
Datadog is very scalable but just watch the cost.
How are customer service and support?
Technical support is hit and miss; there are a number of nuances to how this tool should be implemented, and it is difficult to re-explain how our infrastructure and applications are set up every time we need an in-depth investigation to understand what is broken.
Which solution did I use previously and why did I switch?
Previously, we used AppDynamics. The pricing model didn't seem to fit with actual cloud spend. Now we may have swung the pendulum a little too far, and seem to be dealing with pricing on every facet of the application.
How was the initial setup?
The initial setup was pretty straightforward. Additional tweaks and configuration have been a bit more difficult as we get deeper and deeper into the guts of the integrations. Making sure we are keeping up with a rapid release schedule, and keeping our server clients in sync with our app packages has been troublesome. There have been some major changes in the APM that have introduced a number of bugs and broken some of our dashboards and alerts.
What about the implementation team?
Our in-house team handled the deployment, with a lot of tickets created for the Datadog team.
What was our ROI?
ROI is difficult to measure completely. Our first year spend compared to our second and now going into the third year spend have been significantly different.
What's my experience with pricing, setup cost, and licensing?
My advice is to really keep an eye on your overage costs, as they can spiral really fast. We turned on some additional span measures and didn't realize until it was too late that it had generated a ton.
Frankly, we love the visibility it gives us into our applications, but it is a bit cumbersome to ensure we are paying for the right stuff. Overall, the cost is worth it, as it helps us keep system-critical applications up and running, and reduces our detection and correction times significantly.
Which other solutions did I evaluate?
We evaluated Dynatrace and AppD before choosing this product.
What other advice do I have?
Datadog requires pretty close supervision on the usage page to ensure you aren't going out of control. They have provided a bunch of new features to assist in retention percentage, but it can be a bit confusing on what is being retained, and what can be viewed again after triggering an alert. It's a difficult balance of making sure you are getting the right data for alerts, and still having the correct information still available for research after the fact.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: I am a real user, and this review is based on my own experience and opinions.