We can build dashboards as fast we roll out new systems, which can be fast.
We use standard and custom metrics for every new system we roll out for 360 degree visibility into our systems.
We can build dashboards as fast we roll out new systems, which can be fast.
We use standard and custom metrics for every new system we roll out for 360 degree visibility into our systems.
The most valuable features have been: Sharable dashboards, TimeBoards, dogstatsd API, Slack Integration, Event logging API. CloudTrail Events, Tags, alerts, and anomaly detection. EBS Volume Snapshot Age, which they added upon request. We used PagerDuty integration for a while as well.
More granular control over dashboard sharing. Timeboard sharing.
There are infrequent hiccups, which have been decreasing over the time we have used it.
No.
Customer Service:
Never seen better. Questions answered usually almost immediately, even on weekends. An in-stream with your event stream.
Technical Support:
High.
Overall they have always had an amazing team, and quality has been maintained as the company has grown.
Complementary to other tools we used.
Setup is generally easy. They provide an large number of integrations, some are more complex than others, which is to be expected.
In house implementation.
We didn’t calculate explicitly, but as we used the product to track down underutilized instances, it more than paid for itself in the first month.
Pricing overall in this segment has standardized in the last several years.
A few, including Zabbix and Icinga.
One of the fastest and most flexible tools we have used in this area..
We primarily use the product for tracing, metrics, and alarms in various deployment environments.
The product has provided our company with improved observability, which has helped make the incident response more targeted and quicker.
APM is great and has provided low-effort out-of-the-box observability for various services.
Monitors are helpful, and definitions are simple.
Terraform support is nice as it allows us to create homogenous monitoring environments in various deployment environments with little additional effort. It also facilitates version control of monitor definitions, etc.
The Golang profiler is generally good with the exception of delta profiles; it has provided helpful observability into Heap Allocations which has helped us reduce GC overhead.
Delta traces on the Golang profiler are extremely expensive concerning memory utilization. In a Kubernetes environment where we would like to set per-pod memory allocations as low as possible, the overhead of that profiler feature is prohibitive. In one case, our pods (which were provisioned to target 250 MB and max at 500 MB memory) got stuck in a crash loop due to out-of-memory, which was caused entirely by the delta profiles feature of the profiler.
Multistep Datadog synthetics lack the feature of basic arithmetic. For our use case, performing basic arithmetic on the output of previous steps to produce input for subsequent steps would be extremely useful.
I've used the solution for nine months.
We use Datadog to monitor our Kubernetes clusters.
We have 3 different clusters for different parts of the SDLC. We run the Datadog agent DaemonSet as well as the Datadog cluster agent. Our services have the APM installed by default.
To create monitors, we use Terraform. This is provided out-of-the-box for our service owner.
We run EKS on top of K8s, therefore, we also make use of some of the AWS monitoring capabilities that can be integrated into Datadog.
We are hugely reliant on Datadog for all aspects of our system.
With Datadog, we were able to gain observability in our system.
The installation step is pretty straightforward.
It's easy to use by non-DevOps users. For instance, our engineers do not interact with K8s often; therefore, it is hard for them to debug. However, with Datadog, they are able to view their containers and deployments with a single click.
We also heavily use the tags to help us identify who the service owners are. This is super useful when we need to track owners for patching or pick up new features we implemented.
The APM and K8s monitoring are the most valuable aspects of the solution. The K8s monitoring allows all customers to view their infra, even if they do not use K8s daily. They can just click on a few tabs to get all of the information they need.
It is also very easy to install on our system. APM has helped debug applications on our system as well. We were able to view why a service has suddenly shut down.
We also use Datadog for SLOs/SLAs as well. We check the live endpoint of services to ensure they are still up and running.
There is not much that needs to be improved.
The UI is super user-friendly. The deployment process is easy. We enjoy using the integrations with Slack and PagerDuty.
Customer support is awesome from our experience. There is a lot of documentation for us to be able to use if we need to.
I'm not sure if Datadog can monitor K8s deployments in real-time. For instance, being able to see a deployment step by step visually. This would be helpful if there were any incidents during the deployment.
In general, Datadog is a great solution.
I've used Datadog since I joined my company about a year ago.
We haven't had issues with the stability.
The scalability is really great.
We've had no issues with the product or support.
The initial setup is super simple, and the documentation was helpful.
We managed the initial setup process in-house.
We've witnessed ROI in our DevOps.
We primarily use the solution for monitoring and log analysis.
Datadog shows all the logs for the services, and it is very useful for troubleshooting.
The most valuable aspect of the solution is the APM.
The logging capabilities are quite useful.
The logging could be improved in the future.
I've used the solution for four years.
Our use case is mainly deploying into our applications for monitoring/logging observability. We currently have our microservices feed into an actuator that exists in each instance of our application that extends to a local and central Grafana for client and internal visibility. The application we use is Grafana.
Logging captures application and system logs that are ported to each application instance for querying.
Whenever anything occurs that is considered unhealthy from a range of health checks, we have notification rules configured internally and externally for a prompt response time.
We have been able to be a more confident, knowledgeable, and capable team when everything is being ported into a centralized format. Beforehand, knowledge was isolated to individuals. Knowledge in terms of what information represented and where it was led to a lack of confidence. By having everything in one place, rules out that confusion and allows us to respond better to issues.
It also allows for personal growth as our team is learning the application from the ground up, and each person is enhancing their own skills.
The valuable features include the following:
The ability to find what you are looking for when starting out could be improved. It was a bit overwhelming trying to figure out what is the best solution. It led to many prototypes or time spent just perusing documentation. If we were able to select bundles or template use cases, we would hit the ground running quicker.
I've used the solution for one year.
We are using the solution for scaling up the website for market data applications. EC2 and Datadog have enabled high-level monitoring of underlying infra and services.
The Datadog profiler comes in handy to pinpoint issues with resource utilization during peak hours, and traces/log management helps narrow down the root cause.
The network map is crucial in identifying bottlenecks and determining what needs more attention.
Host map helps identify problematic hardware and devise ways to counter issues that arise during scaling, and deploying solutions on the cloud.
While my team is relatively new to Datadog, I already see immense value in switching over to Datadog as the primary APM and NPM tool.
The arsenal of features it offers is bound to come in a clutch when facing production issues, and when finding out what went wrong is crucial.
The network map has helped to figure out the golden signals and optimize the infrastructure.
The synthetics have helped ensure the high availability of arch functions as intended.
The network map is useful. With it, we have the ability to see the data flow across the entire network path across all the applications is highly valuable as the data from this service helps identify network bottlenecks, non-performant applications, and bad endpoints.
This is especially crucial for a high-availability website aimed at market data applications where low latency is crucial.
The host map gives a clear picture of the entire infrastructure, and the ability to switch between logs, metrics, and traces is very handy when it comes to debugging issues on the fly.
I love the ability to install the integrations and agents quickly. This is a well-made product.
To be very fair, I haven't had enough experience with Datadog to pick out improvements.
My involvement with Datadog has largely been positive. I love the simplicity and intuitiveness it offers - even for nontechnical folks who just might be starting out with developing technical chops in their domain.
I've used the solution for three years.
We use metrics to track the metrics of our application. We use logging to log any errors or erroneous application behavior as well as successful behavior. We use events to log successful steps in our pipeline or failed steps in our deployment. We use a combination of all these features to diagnose bugs.
It makes it much more efficient to look at all the data in one place. This speeds up our development speed so that we can be agile.
This spectrum of solutions has allowed us to track down bugs faster and more rapidly, which allows us to limit revenue lost during downtime.
It also allows us to accurately record and project current and future revenue by measuring the application's metrics. This way, my team can accurately and rapidly create reports for upper management that are easy to read and understand.
Datadog is also easy to read by non-technical personnel. This way, if there are any erroneous readings, everybody has a chance to find them.
We use metrics to track the metrics of our application. We use logging to log any errors or erroneous application behavior as well as successful behavior. We use events to log successful steps in our pipeline or failed steps in our deployment.
We use a combination of all these features to diagnose bugs. It makes it much more efficient to look at all the data in one place. This speeds up our development speed so that we can be agile.
These features are the features that I use the most since it is incredibly difficult to track down intermittent bugs if I were to look directly under the hood in a CLI.
Datadog could make their use cases more visible either through their docs or tutorial videos. There are different implementations of certain features that we utilize to customize Datadog functionality and in that way, we sometimes get results that are not conducive to what Datadog thinks their features' use cases are.
I've used the solution for at least one year.
We have only used Datadog. We did not previously use a different product.
We use the solution for monitoring time spent on views and events triggered. For example, for one of our products, we have created a custom dashboard that lets us track all the custom events and multiple entry points in the same part of the application.
Knowing the entry point helps us choose which part of the program should be improved next. It also helps us with collecting important data about the overall usage of each module within our application.
The solution has helped our organization with custom events to track specific cases.
It's helped with monitoring time spent on views and events triggered. For example, for one of our products, we have created a custom dashboard that lets us track all the custom events as well as multiple entry points into the same part of the application.
Knowing the entry point helps us choose which part of the program should be improved. It's collecting important data about the overall usage of each module within our application.
The most valuable feature is the custom events to track specific cases.
Monitoring time spent on views and events can be triggered. For example, for one of our products, we have created a custom dashboard that lets us track all the custom events and multiple entry points in the same part of the application.
Knowing the entry point helps us decide on improvements. We can collect important data about the overall usage of each module within our application.
We look forward to the next features from Datadog. We need to learn more about the session reply feature inside of DD.
I've used the solution for two years.