What is our primary use case?
We use it for custom metrics of our applications and monitoring of our systems.
How has it helped my organization?
My current company didn't have very good monitoring in the past. We had been using basic CPU monitoring. We have been able to set very specific CPU and memory alerts, at the very base level, then we started to pull real business value, like 99th percentile response rates for our API calls.
It has turned into an operational dashboard. If you felt something is going wrong, you can immediately open up Datadog. It has been our go to application because we know the answer will be there.
What is most valuable?
Their interface is probably one of the easiest things to use because it lets non-developers and non-engineers quickly get access to metrics and pull business value out of them. We could put together dashboards and give it to people who are non-technical, then they can see the state of the world.
They have a very good ecosystem for their integrations. They have a lot of different integrations, and we use a lot of them. We have integrations with Amazon for ECS, RDS, and all of the subsystems of Amazon. We also have Docker and Splunk integrations. The integrations are great because they're definitely vetted and not third-party integrations. They're part of the Datadog ecosystem and seamless.
What needs improvement?
The way data is represented can be limiting. They have added their own little query language that you can use to manipulate things, so you can graph and relate two different metrics together. This is relatively new this year. When I first tried it out a long time ago, you could graph a metric and another metric, and they'd overlay, but you couldn't take the ratio between the two. However, it looks like this is the direction that they're going, and that's a good direction. I think they should continue adding things that way.
I like being able to put the formulas in myself. I don't want the average. I want a rolling average over three minutes, not five minutes. They're getting better at letting the user customize this.
For how long have I used the solution?
Three to five years.
What do I think about the stability of the solution?
When I started using it years ago, it had stability problems. I remember, specifically, we ran everything in Docker containers. There were some problems getting it into a Docker container with very specific memory limits. We couldn't nail down exactly what the limits and the application needed. Once we did that, we were good. However, it was tricky to get the limit in the first place.
What do I think about the scalability of the solution?
It has always scaled for us. Cost scales up too, but that is not necessarily a bad thing. It's reasonable for what they're providing. I haven't had any concerns about scaling.
We use between a 100 to 500 servers at any given point in time.
How is customer service and technical support?
For the most part, the technical support is pretty good. Every now and again, you will get stuck with a support rep who could have better training, but in general, they are very good and responsive. They're willing to talk about new features, etc.
How was the initial setup?
The integration and configuration processes have been very smooth because everything is very well-documented. The documentation is phenomenal.
What was our ROI?
We can see trends a lot easier than if we didn't have the solution. The management can see the changes which are being made, whether it being performance or in the number of hosts that went down. We recently made internal improvements to some of our internal APIs, so we reduced the number of servers that we needed. So, you could see that the load on the system went down and the number of servers went down. Thus, it was easy to visualize.
What's my experience with pricing, setup cost, and licensing?
Pricing and licensing are reasonable for what they give you. You get the first five hosts free, which is fun to play around with. Then it's about four dollars a month per host, which is very affordable for what you get out of it. We have a lot of hosts that we put a lot of custom metrics into, and every host gives you an allowance for the number of custom metrics. We have not had a problem with it.
Which other solutions did I evaluate?
My company now is pretty good at looking at alternatives. Also, I evaluated alternative solutions at my last company.
There are some other competitors. For example, I know one of them started doing metrics and their licensing is very cheap because the metric size is very small and it's per megabyte. They charge you per storage, and it's very small. However, the interface and integrations aren't there. and there are some other competitors,
The other thing is granularity. Datadog gives you one second granularity for a year. Whereas, some of the competitors would roll up, so after about a week you don't have one second, you have five seconds. Then, after a month, you don't have five seconds, you have a minute. So, you start to lose the granularity, whether it be that it averages it or maxes it, you start to lose the ability to see incidents historically, which is super valuable. If we have an incident, which we think we've seen this before, and want to look back historically, we can zoom right in and see in the database where it peaked.
What other advice do I have?
Give Datadog a try. It's the leader in this space.
I have only used the AWS version of the product.
They have a thing for the color purple, but it is all good.
Disclosure: I am a real user, and this review is based on my own experience and opinions.