We have found that some of the different options for filtering for logs ingestion, APM traces and span ingestion, and RUM sessions vs replay settings can be hard to discover and tough to determine how to adjust and tweak for both optimal performance and monitoring as well as for billing within the console. It can sometimes be difficult to determine which information is documented, as we have found inconsistencies with deprecated information, such as environment variables within the documentation.
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view. I like the idea of monitoring on the go, yet it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed. In some cases the screenshots don't match the text as updates are made. I spent longer than I should have figuring out how to correlate logs to traces, mostly related to environmental variables.
Datadog is great overall. One thing to improve would be making it easier to see common patterns across traces. I sometimes end up in a trace but have a hard time finding other common features about the error/requests that are similar to that trace. This could be easier to get to; however, in that case, it's actually an education issue. Another thing that could be improved is the service list page sometimes refreshes slowly, and I accidentally click the wrong environment since the sort changes late.
The query performance could be improved, particularly when handling large datasets, as slower response times can hinder efficiency. Additionally, the interface can sometimes feel overwhelming, with so much happening at once, which may discourage users from exploring new features. Simplifying the layout or providing clearer guidance could enhance user experience. Any improvements related to query optimization would be highly beneficial, as it would further streamline workflows and boost productivity.
Senior Manager, Site Reliability Engineering at Extra Space Storage
Real User
Top 20
2024-09-18T20:43:00Z
Sep 18, 2024
We need better visibility into our consumption rate, which is tied to our commit levels. We would love to see a % consumed and alert us if we are over budget before getting an overage charge 20 days into the month. The biggest complaint we hear comes from the cost of the tool. It is pretty easy to accidentally consume a lot of extra data. Unless you watch everything come in almost daily, you could be in for a big surprise. We utilize the Datadog estimated usage metrics to build out alerts and dashboards. The usage and cost system page still doesn't tie into our committed spending - it would be wonderful to see the monthly burn rate on any given day.
Software Engineer at a computer software company with 201-500 employees
User
Top 20
2024-09-18T19:24:00Z
Sep 18, 2024
One key improvement we would like to see in a future Datadog release is the inclusion of certain metrics that are currently unavailable. Specifically, the ability to monitor CPU and memory utilization of AWS-managed Airflow workers, schedulers, and web servers would be highly beneficial for our organization. These metrics are critical for understanding the performance and resource usage of our Airflow infrastructure, and having them directly in Datadog would provide a more comprehensive view of our system’s health. This would enable us to diagnose issues faster, optimize resource allocation, and improve overall system performance. Including these metrics in Datadog would greatly enhance its utility for teams working with AWS-managed Airflow.
They need an expansion of the Android and IOS apps to provide a simplified CI/CD pipeline history view. I like the idea of monitoring on the go. That said, it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS hosted apps - that need a lot of focus to pick up on the key details needed. In some cases the screenshots don't match the text as updates are made. I spent longer than I should figuring out how to correlate logs to traces, mostly related to environmental variables.
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view. I like the idea of monitoring on the go, however, it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed. Sometimes, the screenshots don't match the text as updates are made. I spent longer than I should have figured out how to correlate logs to traces, mostly related to environmental variables.
Application Development Team Lead at TCS EDUCATION SYSTEM
User
Top 10
2024-09-18T18:11:00Z
Sep 18, 2024
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view. I like the idea of monitoring on the go, however, it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed. In some cases the screenshots don't match the text as updates are made. I feel I spent longer than I should figuring out how to correlate logs to traces, mostly related to environmental variables.
The product can be improved by allowing the grouping of APIs to add variables. That way, any API with a unique ID could be grouped together. Furthermore, SEO monitoring has been crucial for us but also a difficult part to set up as comparing alarms between us and competitors is a tough feat. Data is not always consistent so we have been toying and experimenting with removing the noise of datadog but its been taking a while. Finally, Datadog should have a feature that reports stale alarms based on activity.
It's not that straightforward when creating an alert. The syntax is a little confusing. I guess that the trade-off is customizability. But it would be nice to have a click-and-drag kind of way when creating an alert. So, if someone who isn't so familiar with Datadog or tech in general wanted to create an alert, they wouldn't need to know the syntax. It would also be great if AI could be used to generate alerts and graphs. I could write a short prompt, and then the AI could auto-generate alerts and graphs for me.
I honestly can't think of anything that can be improved. We've started using more and more features from our Datadog account and are really grateful for all of the different ways we can track and monitor our site. We did have an issue where a synthetic test was set up before the holiday break, and we were quickly charged a great amount. Our team worked with Datadog, and they were able to help us out since it was inadvertent on our end and was a user error. That was greatly appreciated and something that helped start our relationship with the Datadog team.
The monitors can be improved. The chart in the monitors only goes back a couple of hours, clunky. Also, it can provide more info, like traces within the monitors. We have many alerts connected to different notification systems, such as Slack and Opsgenie. When the on-caller receives notifications fired by the alerts, we are taken to the monitors. Yet often, we have to open up many different tabs to see logs, traces and info that is not accessible on the monitors. I think it would make all of the on callers' lives easier if the monitor had more data
While Datadog is an excellent monitoring solution, it could be improved by building more features to replace alerting apps like OpsGenie and PagerDuty. Specifically, we'd like to see more advanced incident management capabilities integrated directly into the platform. This could include features like sophisticated on-call scheduling, escalation policies, and incident response workflows. Additionally, we'd appreciate more customizable machine learning-driven anomaly detection to help us identify unusual patterns more accurately. Improved support for serverless architectures, particularly for monitoring and tracing AWS Lambda functions, would be beneficial. Enhanced security monitoring and threat detection capabilities would also be valuable, potentially reducing our reliance on separate security information and event management (SIEM) tools.
Application Engineer at Discover Financial Services
User
Top 20
2024-06-25T16:25:00Z
Jun 25, 2024
One issue I do have with logs is the length of time they are on the platform. Some issues happen sporadically, so it would be good to have logs for longer than one month by default or make it a configuration. I have yet to try rehydrating logs, so this might be an option I need to try. Another issue I have is with the search syntax, it could be simpler. The syntax is a bit cumbersome and there is not an intuitive to save them to look for similar searches in the future. Finally, while my company replaced a different tool for session replay with DataDog's version, I find it clunky and in need of further improvements. For example, when troubleshooting a web portal issue, it is super important to know what the user clicked, but the elements are not where they should be in the replay. It is also hard to find details about the sessions, and metadata such as user email, account, etc. that exist on other services with replay features.
For three to four months, we have been experiencing real-time delays. For example, if we're monitoring incoming traffic, the real-time status should be displayed up to a certain point. However, due to delays or issues with Datadog, the real-time data might only be updated at an earlier time. We are experiencing consistent delays in data updates from Datadog, with the most recent data often being delayed by about an hour. This issue has been ongoing for the past four months.
Delivery Manager, DBA Services at a manufacturing company with 10,001+ employees
Real User
Top 20
2023-01-25T15:49:08Z
Jan 25, 2023
Datadog isn't as mature as some of the established players like Dynatrace or Splunk. It's a new product, so they are constantly releasing new features, and I don't have much to complain about.
Software Engineering Manager at a healthcare company with 501-1,000 employees
Real User
2022-12-06T21:07:00Z
Dec 6, 2022
Overall, we really like the quality and relevance of all of the Datadog products that are currently being used. The documentation is very well organized and is the go-to place for us to find answers to our questions. We would really like to see more from the Service Catalog. It is something that we are interested in. However, some might think it lacks some key features at this time. We will definitely keep our eye out for this and adopt it when all the features are implemented. We're really looking forward to all the great things DD will do.
Integration should have been easier. It is very tough to go to all the services and enable Datadog integration for each AWS service. We can add the AWS services and the services on one page and show only the services that are enabled. A similar approach should be for any other integration. Lately, chat support has a longer waiting time. We would love to get faster chat support. We also need additional support for sending the flare files
Software Developer at a pharma/biotech company with 51-200 employees
Real User
2022-12-06T20:54:00Z
Dec 6, 2022
Sometimes it’s difficult to customize certain queries to find specific things, specifically with the logging solution. I’ve used other logging platforms in the past that have extensive and mature query languages. This might not be super friendly to start out with, yet can be very powerful. I wish there was more of an emphasis on query languages instead of the UI-based tooling that Datadog provides. Even though it is powerful on its own, the UI-based design lacks the elegance, efficiency, and complexity.
Software Engineer at a comms service provider with 5,001-10,000 employees
Real User
2022-12-06T20:44:00Z
Dec 6, 2022
Delta traces on the Golang profiler are extremely expensive concerning memory utilization. In a Kubernetes environment where we would like to set per-pod memory allocations as low as possible, the overhead of that profiler feature is prohibitive. In one case, our pods (which were provisioned to target 250 MB and max at 500 MB memory) got stuck in a crash loop due to out-of-memory, which was caused entirely by the delta profiles feature of the profiler. Multistep Datadog synthetics lack the feature of basic arithmetic. For our use case, performing basic arithmetic on the output of previous steps to produce input for subsequent steps would be extremely useful.
There is not much that needs to be improved. The UI is super user-friendly. The deployment process is easy. We enjoy using the integrations with Slack and PagerDuty. Customer support is awesome from our experience. There is a lot of documentation for us to be able to use if we need to. I'm not sure if Datadog can monitor K8s deployments in real-time. For instance, being able to see a deployment step by step visually. This would be helpful if there were any incidents during the deployment. In general, Datadog is a great solution.
Software Engineering Manager at a hospitality company with 1,001-5,000 employees
Real User
2022-12-06T20:16:00Z
Dec 6, 2022
Datadog is so feature-rich that it is often hard to onboard new folks and tough to decide where to invest time. The APM is a perfect example of this. This feature alone has so much (profiling, tracing, span summary, flame graphs). I would love to see more of the insight and automation-focused features, such as the log patterns, where I can spend time more efficiently. The cost of Datadog at scale can get very expensive very quickly. I would like to see a better usage/cost dashboard with breakdowns like the AWS cost explorer.
Senior Software Engineer at a transportation company with 51-200 employees
Real User
2022-12-06T19:56:00Z
Dec 6, 2022
I found the documentation can sometimes be confusing. I tried configuring APM for some of our Python containers, and I had to cross-reference multiple blog posts and the official documentation to figure out which Datadog-agent to use. If I needed a ddtrace trace, what environment variables I should set, etc. Furthermore, to generate my own traces, I wasn't aware that ddtrace adds its own "monkey patching," which led to headaches with respect to configuring the service for RabbitMQ. A more unified and up-to-date documentation suite would be greatly appreciated.
Atlassian Expert at a tech consulting company with 51-200 employees
Real User
2022-12-06T19:50:00Z
Dec 6, 2022
The current way accounts are billed could be vastly improved - especially when involving multiple organizations across multiple accounts in combination with reserved commitments. Being able to have an automatic materialized report on certain dashboards that could be exported as PDF to be shared with non-Datadog users could help a lot. Other than that, we are more than happy with the features we use regularly.
Senior Site Reliability Engineer at a tech vendor with 10,001+ employees
Real User
2022-12-06T19:42:00Z
Dec 6, 2022
Managing dashboards as IaC is a bit hard to work out at times. I use custom tools to convert JSON dashboards to Terraform resources. Ideally, I'd like for some sort of building tool for this to be built into the app. For example, a templating system that can easily be exported to IaC would be transformative for us. There are also some aspects of the API that can be a bit verbose - especially in the area of new features like SLOs - and take some time to understand. That said, overall, they're well-documented enough to be a minor concern for us.
Senior Engineering Manager,Mobile Wireless Engineering at a comms service provider with 10,001+ employees
Real User
2022-12-06T19:26:00Z
Dec 6, 2022
We need more integration functionality, including certain metrics integration. We should be able to monitor devs and need it to build more monitoring tools and offer leadership metrics.
The product is quite complex, and there are so many features that I either didn't know about or wasn't sure how to use. One thing that could be improved is somehow surfacing interesting or relevant products that might be applicable given our infrastructure. Additionally, the billing can sometimes be confusing and opaque, especially around not making it obvious what the implications can be if you add different AWS integrations. This has caused some unexpected costs in the past due to engineers not understanding how Datadog pricing works.
We primarily use the log management functionality, and the only feedback I have there is better fuzzy text searching in logs (the kind that Kibana has). I've learned about a ton of other offerings, like APM, NPM, etc., over the course of workshops. Once I try those out, I'm sure I will have additional feedback.
Site Reliability Engineer at a financial services firm with 1-10 employees
Real User
2022-10-05T09:22:08Z
Oct 5, 2022
Graph filters for logs need to be set manually which works well for JSON but not for unstructured logs. Making structured logs for high-performance applications is over our heads so we had to dump some technical streams for our logs.
Datadog could be improved if it could detect other software in a container or server. Datadog is better than other APM or observability tools, but it focuses mostly on telling the customer what they need to know about the software, database or applications that land on the server. We also need to know the version before setting up an agent with the APM modeling tool. In some instances, the owner of a particular software changes to another person and this person did not originally transfer the knowledge or data to manage the server. The new person needs to monitor this server and they need to know what software or version of software was installed on this server before they used the APM agent for monitoring. If datadog could provide this insight, it would improve how we use the solution. In a future release, we would like to be able to complete a network traffic or network flow analysis to detect the errors or problems on the network.
Senior Engineer at a educational organization with 5,001-10,000 employees
Real User
2022-08-15T10:42:13Z
Aug 15, 2022
Datadog needs more local Asia-Pacific support, and if they don't have a SaaS solution in Asia-Pacific, they should offer an on-prem version. I'm told that's not possible.
I haven't really noticed anything that they could improve upon. Maybe they could add in some features to go both ways, to maybe make some configuration changes, etc. That's a little bit outside of what Datadog does, though. It's really very full-featured, so I don't really have any complaints. I haven't really fully looked at the documentation as I know where I need to go and look at things. It could probably be a little bit of a better user experience. There are so many functions there that sometimes navigating your way around is a little bit hard. They have a really nice menu system. However, there's so much there. It's possible that I skipped a guided tour when I started. It’s not intuitive to everyone. There are a lot of technical features.
IT Test Manager at a transportation company with 10,001+ employees
Real User
2022-03-29T15:58:56Z
Mar 29, 2022
I'd like to see more flexibility in the customization and they have a few settings which need to be changed but we are unable to make those changes as users or as the administrator. The tagging to get the different parts of the monitoring interconnected is a bit tricky and takes time to work out.
Chief Strategy Officer (CSO) at a computer software company with 11-50 employees
Real User
2022-02-04T12:22:43Z
Feb 4, 2022
Datadog has a lot of features kind of cramped into one dashboard. It's quite hard to get around what feature does exactly what. There was a steep learning curve, trying to navigate through menus. The menu navigation could improve. If there was a more straightforward way of adding new functions or features to where each menu is placed that would be an improvement.
Sr.Tech.Analyst Monitoreo at a financial services firm with 1,001-5,000 employees
Real User
2021-11-07T10:11:00Z
Nov 7, 2021
It could use some additional features when working with metrics like Grafana or like New Relic has. Datadog does not use library technologies like Dynatrace does. Datadog has machine learning too, but it does not have this option in all layers of monitoring like infrastructure service process in applications.
Senior Cyber Security Expert at a security firm with 11-50 employees
Real User
2021-09-09T19:57:09Z
Sep 9, 2021
While I like the ease of use, when compared with Tenable Nessus they could still improve their usability. They are okay, but there is room to be better. They could have more integration. They could be more intuitive as well. For example, the intuitivity of the user interfaces, and how long it takes for users to learn how to use Datadog. It is not impossible to use, or impossible to do the administration with it but when you put these two next to each other, meaning Nessus and Datadog, Nessus comes out as the winner.
Project Director at a tech services company with 501-1,000 employees
Real User
2021-05-18T17:10:08Z
May 18, 2021
Its pricing model can be improved. Its settings should be improved for a better understanding of billing. They should also provide some alerts when there is an increase in usage. For example, if there is a 20% more increase from one week to another, the customer should get an alert.
It can have a more modernized pricing mechanism. We're actually working with them to figure out how to become more modular and have a better and more modernized pricing mechanism. The issue with Datadog is that you have to buy the whole suite of different products, and you kind of get stuck in the old utilization of 40% of their suite. Most organizations today break down between application development, networking, and security. Therefore, there should be a way to break down different modules into just app dev, infosec, networking, etc. Customers have various needs across their business lines, and sometimes, they're just not willing to have tools that they're not using 100%. AppDynamics is probably a little bit better in terms of being modular.
Senior Manager, Site Reliability Engineering at Extra Space Storage
Real User
Top 20
2021-01-25T19:36:00Z
Jan 25, 2021
Continued improvement around cost and pricing model is needed. It is pretty complex and takes a fair amount of intimate knowledge to know exactly how turning on a single function is going to impact your bill, especially when you don't see the metrics for a day or two. We have recently had a number of issues with stability and delays on logging, monitoring, metric evaluation, and alerts. More often than not in the past month, it seems that we get the banner across the to of our dashboards that some service is impacted. They don't always show up on the incident page, either.
We need the ability to create a service dependency map like Splunk ITSI. We have to build this in PagerDuty and it's not the best user experience. The ability to create custom inventory objects based on logs ingested would be a value add. It would be better if Datadog makes this a simple click and enable. It would be helpful to have the ability to upgrade agents via the Datadog portal. Once agents are connected to the Datadog portal, we should be able to upgrade them quickly. Security monitoring for Azure and Operating System (Windows and Linux) are features that need to be addressed. Dashboards for Azure Active Directory metrics and events should be improved.
The incident management beta looks promising, but it is still missing the ability to automatically create incidents based on certain alerts. SLOs are also a great way to visualize how you are doing with regard to the level of service that you are providing but it missing crucial components like: * The ability to visualize the remaining error budget and how it evolved during the month. An error budget burndown graph would be helpful. * The ability to display a different level of alert on an SLO based on how fast it is consuming the error budget. This is the slow burn versus fast burn.
Their logging solution is expensive for our use case. They do have the capability to rehydrate old or incomplete logs, and it works, but I would rather not have to think about that operation. Datadog has a lot of documentation, but a lot of that documentation assumes you know how the service works, which can lead to confusion. Positive note is that they do have lots of documentation, it just needs better curation. Their APM solution still needs some work, but they are actively developing it. I would also like to see more database-specific application monitoring.
Please add PHP profiling; you already have it for other popular programming languages such as Python and Java, which is great because we have a little bit of those, but our main app is powered by PHP and we don't have profiling for this yet. I guess it's only a matter of time for this to be added, so in the meanwhile, you can consider this review as a vote for the PHP profiling support. The pricing model could be simplified as it feels a bit outdated, especially when you look at the billing model of compute instances vs the containers instances.
More pre-configured "Monitor Alerts" would be helpful. Datadog's knowledge of its customers and what they are looking for in terms of monitoring and alerting could be taken advantage of with pre-canned alerts. They have started this with "Recommended Monitors". That feature was very helpful when configuring our Kubernetes alerts. More would be even better. Datadog tech support is very good. One area that could be more helpful is actually talking to someone or sharing your screen to help troubleshoot issues that arise. For new cloud engineers just coming into the cloud monitoring field, there is a learning curve. There is a lot to learn and figure out. For example, we still ran into some issues configuring the private link and more videos of how to do things could be of use.
Technology Competency and Solution Head at LearningMate
Real User
2020-11-25T16:41:00Z
Nov 25, 2020
The error traceability is an area that can be improved. This is something that helps us to pinpoint the area where a problem is occurring. It is a function stack, and it should be showing us how each function is defined.
Senior Cloud Security Engineer at a financial services firm with 201-500 employees
Real User
2020-10-21T04:33:58Z
Oct 21, 2020
I believe there is room for improvement with this solution. It wasn't easy for me to get a quick understanding of what this tool offers us as opposed to the added tools of AWS. By that, I mean in regards to finding a better way to apply some filters or to create some alarms. I don't get more advanced features in comparison to AWS but at least I get a centralized way of doing things, which can be done on the AWS side as well. It's more complicated because you have to configure some other services to stream their logs from multi accounts to one account. It could be more user friendly and include advanced examples in the documentation showing some use cases or customer case studies, so you can get a clear idea that this functionality provides something extra.
Datadog lacks a deeper application-level insight. Their competitors had eclipsed them in offering ET functionality that was important to us. That's why we stopped using it and switched to New Relic. Datadog's price is also high.
There are things about it that we would like to be fixed, such as it is taking averages of average. This results in data that we don't expect, but overall we are happy with it.
The product could do better with its notifications. I want more technical support than conferences because technical support helps with setting up the product much easier.
Some of their newer solutions are interesting, like their logging, but they are not fleshed out. They could use more metrics or synthetics, which would be really helpful. I would love to see support for front-end and mobile applications. Right now, it is mostly all back-end stuff. Being able to do some integration with our front-end products would be awesome.
The only thing that they were missing that has throw us from the beginning (they are still missing it) is consistency in the APIs. There are a couple of guys on the automation side who complain rightfully over how hard it is because every new feature which comes out has a new way of interfacing with the API. This was our big, red flag in the beginning, but given the price and other features, it wasn't enough for us to discount. We said "That we would live with this one red flag", but it is still a red flag. Stability of the product has been a concern for us outside of the primary monitoring agents. It does not have the best interface.
System Ninja at a philanthropy with 51-200 employees
Real User
2018-12-11T08:30:00Z
Dec 11, 2018
We want to reduce having to go to different screens to obtain all the information. However, they are moving in the right direction from what we have noticed.
Site Reliability Engineer at a computer software company with 201-500 employees
Real User
2018-12-04T07:57:00Z
Dec 4, 2018
The way data is represented can be limiting. They have added their own little query language that you can use to manipulate things, so you can graph and relate two different metrics together. This is relatively new this year. When I first tried it out a long time ago, you could graph a metric and another metric, and they'd overlay, but you couldn't take the ratio between the two. However, it looks like this is the direction that they're going, and that's a good direction. I think they should continue adding things that way. I like being able to put the formulas in myself. I don't want the average. I want a rolling average over three minutes, not five minutes. They're getting better at letting the user customize this.
Datadog is a comprehensive cloud monitoring platform designed to track performance, availability, and log aggregation for cloud resources like AWS, ECS, and Kubernetes. It offers robust tools for creating dashboards, observing user behavior, alerting, telemetry, security monitoring, and synthetic testing.
Datadog supports full observability across cloud providers and environments, enabling troubleshooting, error detection, and performance analysis to maintain system reliability. It offers...
We have found that some of the different options for filtering for logs ingestion, APM traces and span ingestion, and RUM sessions vs replay settings can be hard to discover and tough to determine how to adjust and tweak for both optimal performance and monitoring as well as for billing within the console. It can sometimes be difficult to determine which information is documented, as we have found inconsistencies with deprecated information, such as environment variables within the documentation.
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view. I like the idea of monitoring on the go, yet it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed. In some cases the screenshots don't match the text as updates are made. I spent longer than I should have figuring out how to correlate logs to traces, mostly related to environmental variables.
Datadog is great overall. One thing to improve would be making it easier to see common patterns across traces. I sometimes end up in a trace but have a hard time finding other common features about the error/requests that are similar to that trace. This could be easier to get to; however, in that case, it's actually an education issue. Another thing that could be improved is the service list page sometimes refreshes slowly, and I accidentally click the wrong environment since the sort changes late.
The query performance could be improved, particularly when handling large datasets, as slower response times can hinder efficiency. Additionally, the interface can sometimes feel overwhelming, with so much happening at once, which may discourage users from exploring new features. Simplifying the layout or providing clearer guidance could enhance user experience. Any improvements related to query optimization would be highly beneficial, as it would further streamline workflows and boost productivity.
We need better visibility into our consumption rate, which is tied to our commit levels. We would love to see a % consumed and alert us if we are over budget before getting an overage charge 20 days into the month. The biggest complaint we hear comes from the cost of the tool. It is pretty easy to accidentally consume a lot of extra data. Unless you watch everything come in almost daily, you could be in for a big surprise. We utilize the Datadog estimated usage metrics to build out alerts and dashboards. The usage and cost system page still doesn't tie into our committed spending - it would be wonderful to see the monthly burn rate on any given day.
One key improvement we would like to see in a future Datadog release is the inclusion of certain metrics that are currently unavailable. Specifically, the ability to monitor CPU and memory utilization of AWS-managed Airflow workers, schedulers, and web servers would be highly beneficial for our organization. These metrics are critical for understanding the performance and resource usage of our Airflow infrastructure, and having them directly in Datadog would provide a more comprehensive view of our system’s health. This would enable us to diagnose issues faster, optimize resource allocation, and improve overall system performance. Including these metrics in Datadog would greatly enhance its utility for teams working with AWS-managed Airflow.
They need an expansion of the Android and IOS apps to provide a simplified CI/CD pipeline history view. I like the idea of monitoring on the go. That said, it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS hosted apps - that need a lot of focus to pick up on the key details needed. In some cases the screenshots don't match the text as updates are made. I spent longer than I should figuring out how to correlate logs to traces, mostly related to environmental variables.
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view. I like the idea of monitoring on the go, however, it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed. Sometimes, the screenshots don't match the text as updates are made. I spent longer than I should have figured out how to correlate logs to traces, mostly related to environmental variables.
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view. I like the idea of monitoring on the go, however, it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed. In some cases the screenshots don't match the text as updates are made. I feel I spent longer than I should figuring out how to correlate logs to traces, mostly related to environmental variables.
The product can be improved by allowing the grouping of APIs to add variables. That way, any API with a unique ID could be grouped together. Furthermore, SEO monitoring has been crucial for us but also a difficult part to set up as comparing alarms between us and competitors is a tough feat. Data is not always consistent so we have been toying and experimenting with removing the noise of datadog but its been taking a while. Finally, Datadog should have a feature that reports stale alarms based on activity.
It's not that straightforward when creating an alert. The syntax is a little confusing. I guess that the trade-off is customizability. But it would be nice to have a click-and-drag kind of way when creating an alert. So, if someone who isn't so familiar with Datadog or tech in general wanted to create an alert, they wouldn't need to know the syntax. It would also be great if AI could be used to generate alerts and graphs. I could write a short prompt, and then the AI could auto-generate alerts and graphs for me.
I honestly can't think of anything that can be improved. We've started using more and more features from our Datadog account and are really grateful for all of the different ways we can track and monitor our site. We did have an issue where a synthetic test was set up before the holiday break, and we were quickly charged a great amount. Our team worked with Datadog, and they were able to help us out since it was inadvertent on our end and was a user error. That was greatly appreciated and something that helped start our relationship with the Datadog team.
The monitors can be improved. The chart in the monitors only goes back a couple of hours, clunky. Also, it can provide more info, like traces within the monitors. We have many alerts connected to different notification systems, such as Slack and Opsgenie. When the on-caller receives notifications fired by the alerts, we are taken to the monitors. Yet often, we have to open up many different tabs to see logs, traces and info that is not accessible on the monitors. I think it would make all of the on callers' lives easier if the monitor had more data
While Datadog is an excellent monitoring solution, it could be improved by building more features to replace alerting apps like OpsGenie and PagerDuty. Specifically, we'd like to see more advanced incident management capabilities integrated directly into the platform. This could include features like sophisticated on-call scheduling, escalation policies, and incident response workflows. Additionally, we'd appreciate more customizable machine learning-driven anomaly detection to help us identify unusual patterns more accurately. Improved support for serverless architectures, particularly for monitoring and tracing AWS Lambda functions, would be beneficial. Enhanced security monitoring and threat detection capabilities would also be valuable, potentially reducing our reliance on separate security information and event management (SIEM) tools.
One issue I do have with logs is the length of time they are on the platform. Some issues happen sporadically, so it would be good to have logs for longer than one month by default or make it a configuration. I have yet to try rehydrating logs, so this might be an option I need to try. Another issue I have is with the search syntax, it could be simpler. The syntax is a bit cumbersome and there is not an intuitive to save them to look for similar searches in the future. Finally, while my company replaced a different tool for session replay with DataDog's version, I find it clunky and in need of further improvements. For example, when troubleshooting a web portal issue, it is super important to know what the user clicked, but the elements are not where they should be in the replay. It is also hard to find details about the sessions, and metadata such as user email, account, etc. that exist on other services with replay features.
For three to four months, we have been experiencing real-time delays. For example, if we're monitoring incoming traffic, the real-time status should be displayed up to a certain point. However, due to delays or issues with Datadog, the real-time data might only be updated at an earlier time. We are experiencing consistent delays in data updates from Datadog, with the most recent data often being delayed by about an hour. This issue has been ongoing for the past four months.
Datadog is expensive.
The solution needs to integrate AI tools.
The product needs to have more enterprise approach to configuration.
Datadog isn't as mature as some of the established players like Dynatrace or Splunk. It's a new product, so they are constantly releasing new features, and I don't have much to complain about.
Overall, we really like the quality and relevance of all of the Datadog products that are currently being used. The documentation is very well organized and is the go-to place for us to find answers to our questions. We would really like to see more from the Service Catalog. It is something that we are interested in. However, some might think it lacks some key features at this time. We will definitely keep our eye out for this and adopt it when all the features are implemented. We're really looking forward to all the great things DD will do.
Integration should have been easier. It is very tough to go to all the services and enable Datadog integration for each AWS service. We can add the AWS services and the services on one page and show only the services that are enabled. A similar approach should be for any other integration. Lately, chat support has a longer waiting time. We would love to get faster chat support. We also need additional support for sending the flare files
We need more integration with security tools like Drata.
Sometimes it’s difficult to customize certain queries to find specific things, specifically with the logging solution. I’ve used other logging platforms in the past that have extensive and mature query languages. This might not be super friendly to start out with, yet can be very powerful. I wish there was more of an emphasis on query languages instead of the UI-based tooling that Datadog provides. Even though it is powerful on its own, the UI-based design lacks the elegance, efficiency, and complexity.
Custom-level metrics could be improved. Billing should be more transparent.
Delta traces on the Golang profiler are extremely expensive concerning memory utilization. In a Kubernetes environment where we would like to set per-pod memory allocations as low as possible, the overhead of that profiler feature is prohibitive. In one case, our pods (which were provisioned to target 250 MB and max at 500 MB memory) got stuck in a crash loop due to out-of-memory, which was caused entirely by the delta profiles feature of the profiler. Multistep Datadog synthetics lack the feature of basic arithmetic. For our use case, performing basic arithmetic on the output of previous steps to produce input for subsequent steps would be extremely useful.
The product needs a better Datadog agent installation.
There is not much that needs to be improved. The UI is super user-friendly. The deployment process is easy. We enjoy using the integrations with Slack and PagerDuty. Customer support is awesome from our experience. There is a lot of documentation for us to be able to use if we need to. I'm not sure if Datadog can monitor K8s deployments in real-time. For instance, being able to see a deployment step by step visually. This would be helpful if there were any incidents during the deployment. In general, Datadog is a great solution.
The logging could be improved in the future.
Datadog is so feature-rich that it is often hard to onboard new folks and tough to decide where to invest time. The APM is a perfect example of this. This feature alone has so much (profiling, tracing, span summary, flame graphs). I would love to see more of the insight and automation-focused features, such as the log patterns, where I can spend time more efficiently. The cost of Datadog at scale can get very expensive very quickly. I would like to see a better usage/cost dashboard with breakdowns like the AWS cost explorer.
I found the documentation can sometimes be confusing. I tried configuring APM for some of our Python containers, and I had to cross-reference multiple blog posts and the official documentation to figure out which Datadog-agent to use. If I needed a ddtrace trace, what environment variables I should set, etc. Furthermore, to generate my own traces, I wasn't aware that ddtrace adds its own "monkey patching," which led to headaches with respect to configuring the service for RabbitMQ. A more unified and up-to-date documentation suite would be greatly appreciated.
The current way accounts are billed could be vastly improved - especially when involving multiple organizations across multiple accounts in combination with reserved commitments. Being able to have an automatic materialized report on certain dashboards that could be exported as PDF to be shared with non-Datadog users could help a lot. Other than that, we are more than happy with the features we use regularly.
Managing dashboards as IaC is a bit hard to work out at times. I use custom tools to convert JSON dashboards to Terraform resources. Ideally, I'd like for some sort of building tool for this to be built into the app. For example, a templating system that can easily be exported to IaC would be transformative for us. There are also some aspects of the API that can be a bit verbose - especially in the area of new features like SLOs - and take some time to understand. That said, overall, they're well-documented enough to be a minor concern for us.
We need more integration functionality, including certain metrics integration. We should be able to monitor devs and need it to build more monitoring tools and offer leadership metrics.
The product is quite complex, and there are so many features that I either didn't know about or wasn't sure how to use. One thing that could be improved is somehow surfacing interesting or relevant products that might be applicable given our infrastructure. Additionally, the billing can sometimes be confusing and opaque, especially around not making it obvious what the implications can be if you add different AWS integrations. This has caused some unexpected costs in the past due to engineers not understanding how Datadog pricing works.
We primarily use the log management functionality, and the only feedback I have there is better fuzzy text searching in logs (the kind that Kibana has). I've learned about a ton of other offerings, like APM, NPM, etc., over the course of workshops. Once I try those out, I'm sure I will have additional feedback.
Sometimes, it takes a long time to load the dashboard if we have many charts.
Graph filters for logs need to be set manually which works well for JSON but not for unstructured logs. Making structured logs for high-performance applications is over our heads so we had to dump some technical streams for our logs.
Datadog could be improved if it could detect other software in a container or server. Datadog is better than other APM or observability tools, but it focuses mostly on telling the customer what they need to know about the software, database or applications that land on the server. We also need to know the version before setting up an agent with the APM modeling tool. In some instances, the owner of a particular software changes to another person and this person did not originally transfer the knowledge or data to manage the server. The new person needs to monitor this server and they need to know what software or version of software was installed on this server before they used the APM agent for monitoring. If datadog could provide this insight, it would improve how we use the solution. In a future release, we would like to be able to complete a network traffic or network flow analysis to detect the errors or problems on the network.
Datadog needs more local Asia-Pacific support, and if they don't have a SaaS solution in Asia-Pacific, they should offer an on-prem version. I'm told that's not possible.
I haven't really noticed anything that they could improve upon. Maybe they could add in some features to go both ways, to maybe make some configuration changes, etc. That's a little bit outside of what Datadog does, though. It's really very full-featured, so I don't really have any complaints. I haven't really fully looked at the documentation as I know where I need to go and look at things. It could probably be a little bit of a better user experience. There are so many functions there that sometimes navigating your way around is a little bit hard. They have a really nice menu system. However, there's so much there. It's possible that I skipped a guided tour when I started. It’s not intuitive to everyone. There are a lot of technical features.
Datadog could improve the flexibility with AI and ML concepts. This will allow customers to be more leveraged towards publishing.
I'd like to see more flexibility in the customization and they have a few settings which need to be changed but we are unable to make those changes as users or as the administrator. The tagging to get the different parts of the monitoring interconnected is a bit tricky and takes time to work out.
Datadog has a lot of features kind of cramped into one dashboard. It's quite hard to get around what feature does exactly what. There was a steep learning curve, trying to navigate through menus. The menu navigation could improve. If there was a more straightforward way of adding new functions or features to where each menu is placed that would be an improvement.
The setup was a bit complex. As Datadog is a bit on the expensive side, I would recommend it for simple, uncomplicated, solutions.
It could use some additional features when working with metrics like Grafana or like New Relic has. Datadog does not use library technologies like Dynatrace does. Datadog has machine learning too, but it does not have this option in all layers of monitoring like infrastructure service process in applications.
They could look into improving the integration. I'd like to see better pricing and more integration in the next release.
While I like the ease of use, when compared with Tenable Nessus they could still improve their usability. They are okay, but there is room to be better. They could have more integration. They could be more intuitive as well. For example, the intuitivity of the user interfaces, and how long it takes for users to learn how to use Datadog. It is not impossible to use, or impossible to do the administration with it but when you put these two next to each other, meaning Nessus and Datadog, Nessus comes out as the winner.
Its pricing model can be improved. Its settings should be improved for a better understanding of billing. They should also provide some alerts when there is an increase in usage. For example, if there is a 20% more increase from one week to another, the customer should get an alert.
It can have a more modernized pricing mechanism. We're actually working with them to figure out how to become more modular and have a better and more modernized pricing mechanism. The issue with Datadog is that you have to buy the whole suite of different products, and you kind of get stuck in the old utilization of 40% of their suite. Most organizations today break down between application development, networking, and security. Therefore, there should be a way to break down different modules into just app dev, infosec, networking, etc. Customers have various needs across their business lines, and sometimes, they're just not willing to have tools that they're not using 100%. AppDynamics is probably a little bit better in terms of being modular.
Continued improvement around cost and pricing model is needed. It is pretty complex and takes a fair amount of intimate knowledge to know exactly how turning on a single function is going to impact your bill, especially when you don't see the metrics for a day or two. We have recently had a number of issues with stability and delays on logging, monitoring, metric evaluation, and alerts. More often than not in the past month, it seems that we get the banner across the to of our dashboards that some service is impacted. They don't always show up on the incident page, either.
We need the ability to create a service dependency map like Splunk ITSI. We have to build this in PagerDuty and it's not the best user experience. The ability to create custom inventory objects based on logs ingested would be a value add. It would be better if Datadog makes this a simple click and enable. It would be helpful to have the ability to upgrade agents via the Datadog portal. Once agents are connected to the Datadog portal, we should be able to upgrade them quickly. Security monitoring for Azure and Operating System (Windows and Linux) are features that need to be addressed. Dashboards for Azure Active Directory metrics and events should be improved.
The incident management beta looks promising, but it is still missing the ability to automatically create incidents based on certain alerts. SLOs are also a great way to visualize how you are doing with regard to the level of service that you are providing but it missing crucial components like: * The ability to visualize the remaining error budget and how it evolved during the month. An error budget burndown graph would be helpful. * The ability to display a different level of alert on an SLO based on how fast it is consuming the error budget. This is the slow burn versus fast burn.
In the past two years, there have been a couple of outages.
Their logging solution is expensive for our use case. They do have the capability to rehydrate old or incomplete logs, and it works, but I would rather not have to think about that operation. Datadog has a lot of documentation, but a lot of that documentation assumes you know how the service works, which can lead to confusion. Positive note is that they do have lots of documentation, it just needs better curation. Their APM solution still needs some work, but they are actively developing it. I would also like to see more database-specific application monitoring.
Please add PHP profiling; you already have it for other popular programming languages such as Python and Java, which is great because we have a little bit of those, but our main app is powered by PHP and we don't have profiling for this yet. I guess it's only a matter of time for this to be added, so in the meanwhile, you can consider this review as a vote for the PHP profiling support. The pricing model could be simplified as it feels a bit outdated, especially when you look at the billing model of compute instances vs the containers instances.
More pre-configured "Monitor Alerts" would be helpful. Datadog's knowledge of its customers and what they are looking for in terms of monitoring and alerting could be taken advantage of with pre-canned alerts. They have started this with "Recommended Monitors". That feature was very helpful when configuring our Kubernetes alerts. More would be even better. Datadog tech support is very good. One area that could be more helpful is actually talking to someone or sharing your screen to help troubleshoot issues that arise. For new cloud engineers just coming into the cloud monitoring field, there is a learning curve. There is a lot to learn and figure out. For example, we still ran into some issues configuring the private link and more videos of how to do things could be of use.
The error traceability is an area that can be improved. This is something that helps us to pinpoint the area where a problem is occurring. It is a function stack, and it should be showing us how each function is defined.
I believe there is room for improvement with this solution. It wasn't easy for me to get a quick understanding of what this tool offers us as opposed to the added tools of AWS. By that, I mean in regards to finding a better way to apply some filters or to create some alarms. I don't get more advanced features in comparison to AWS but at least I get a centralized way of doing things, which can be done on the AWS side as well. It's more complicated because you have to configure some other services to stream their logs from multi accounts to one account. It could be more user friendly and include advanced examples in the documentation showing some use cases or customer case studies, so you can get a clear idea that this functionality provides something extra.
Datadog lacks a deeper application-level insight. Their competitors had eclipsed them in offering ET functionality that was important to us. That's why we stopped using it and switched to New Relic. Datadog's price is also high.
Additional metrics should be included. Better integration with other solutions is needed.
There are things about it that we would like to be fixed, such as it is taking averages of average. This results in data that we don't expect, but overall we are happy with it.
The product could do better with its notifications. I want more technical support than conferences because technical support helps with setting up the product much easier.
Some of their newer solutions are interesting, like their logging, but they are not fleshed out. They could use more metrics or synthetics, which would be really helpful. I would love to see support for front-end and mobile applications. Right now, it is mostly all back-end stuff. Being able to do some integration with our front-end products would be awesome.
The only thing that they were missing that has throw us from the beginning (they are still missing it) is consistency in the APIs. There are a couple of guys on the automation side who complain rightfully over how hard it is because every new feature which comes out has a new way of interfacing with the API. This was our big, red flag in the beginning, but given the price and other features, it wasn't enough for us to discount. We said "That we would live with this one red flag", but it is still a red flag. Stability of the product has been a concern for us outside of the primary monitoring agents. It does not have the best interface.
I would like testing for data in the future. That would be really nice. Also, I would like some additional enhancement in the visuals.
We want to reduce having to go to different screens to obtain all the information. However, they are moving in the right direction from what we have noticed.
The on-premise version is very difficult to upgrade.
The way data is represented can be limiting. They have added their own little query language that you can use to manipulate things, so you can graph and relate two different metrics together. This is relatively new this year. When I first tried it out a long time ago, you could graph a metric and another metric, and they'd overlay, but you couldn't take the ratio between the two. However, it looks like this is the direction that they're going, and that's a good direction. I think they should continue adding things that way. I like being able to put the formulas in myself. I don't want the average. I want a rolling average over three minutes, not five minutes. They're getting better at letting the user customize this.