We use Datadog for monitoring the performance of our infrastructure across multiple types of hosts in multiple environments. We also use APM to monitor our applications in production. We have some Kubernetes clusters and multi-cloud hosts with Datadog agents installed. We have recently added RUM to monitoring our application from the user side, including replay sessions, and are hoping to use those to replace existing monitoring for errors and session replay for debugging issues in the application.
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting. We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge. Datadog agents on each web host, and native integrations with GitHub, AWS, and Azure gets all of our instrumentation and error data in one place for easy analysis and monitoring.
We use the solution to monitor and investigate issues with production services at work. We're periodically reviewing the service catalog view for the various applications and I use it to identify any anomalies with service metrics, any changes in user behavior evident via API calls, and/or spikes in errors. We use monitors to trigger alerts for on-call engineers to act upon. The monitors have set thresholds for request latency, error rates, and throughput. We also use automated rules to block bad actors based on request volume or patterns.
Our primary use case for Datadog is to monitor, analyze, and optimize the performance and health of our applications and infrastructure. We leverage its logging, metrics, and tracing capabilities to pinpoint issues, track system performance, and improve overall reliability. Datadog’s ability to provide real-time insights and alerting on key metrics helps us quickly address issues, ensuring smooth operations. It’s integral for visibility across our microservices architecture and cloud environments.
Senior Manager, Site Reliability Engineering at Extra Space Storage
Real User
Top 20
2024-09-18T20:43:00Z
Sep 18, 2024
The product monitors multiple systems, from customer interactions on our web applications down to the database and all layers in between. RUM, APM, logging, and infrastructure monitoring are all surfaced into single dashboards. We initially started with application logs and generated long-term business metrics out of critical logs. We have turned those metrics and logs into a collection of alerts integrated into our pager system. As we have evolved, we have also used APM and RUM data to trigger additional alerts.
Software Engineer at a computer software company with 201-500 employees
User
Top 20
2024-09-18T19:24:00Z
Sep 18, 2024
Our primary use case for Datadog involves utilizing its dashboards, monitors, and alerts to monitor several key components of our infrastructure. We track the performance of AWS-managed Airflow pipelines, focusing on metrics like data freshness, data volume, pipeline success rates, and overall performance. In addition, we monitor Looker dashboard performance to ensure data is processed efficiently. Database performance is also closely tracked, allowing us to address any potential issues proactively. This setup provides comprehensive observability and ensures that our systems operate smoothly.
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting. We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge. Datadog agents on each web host and native integrations with GitHub, AWS, and Azure get all of our instrumentation and error data in one place for easy analysis and monitoring.
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting. We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. We're managing a hybrid multi-cloud solution across hundreds of applications, which is always a challenge. There are Datadog agents on each web host, and native integrations with GitHub, AWS, and Azure and that gets all of our instrumentation and error data in one place for easy analysis and monitoring.
Application Development Team Lead at TCS EDUCATION SYSTEM
User
Top 20
2024-09-18T18:11:00Z
Sep 18, 2024
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting. We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge. Datadog agents on each web host and native integrations with GitHub, AWS, and Azure get all of our instrumentation and error data in one place for easy analysis and monitoring.
We use the solution to monitor production service uptime/downtime, latency, and log storage. Our entire monitoring infrastructure runs off Datadog, so all our alarms are configured with it. We also use it for tracing API performance; what are the biggest regression points. Finally we use it to compare performance on SEO metrics vs competitors. This is a primary use case as SEO dictates our position from google traffic which is a large portion of our customer view generation so it is a vital part of the business we rely on datadog for.
We have several teams and several different projects, all working in tandem, so there are a lot of logs and monitoring that need to be done. We use Datadog mostly for alerting when things go down. We also have several dashboards to keep track of critical operations and to make sure things are running without issues. The Slack messaging is essential in our workflow in letting us know when an alert is triggered. I also appreciate all the graphs you can make, as it gives our team a good overview of how our services are doing.
We currently have an error monitor to monitor errors on our prod environment. Once we hit a certain threshold, we get an alert on Slack. This helps address issues the moment they happen before our users notice. We also utilize synthetic tests on many pages on our site. They're easy to set up and are great for pinpointing when a bug is shipped, but they may take down a less visited page that we aren't immediately aware of. It's a great extra check to make sure the code we ship is free of bugs.
Our company has a microservice architecture, with different teams in charge of different services. Also, it is a start, which means that we have to build fast and move very fast as well. So before we were properly using DD, we often had issues of things breaking, but without much information on where in our system the breaking happened. This was quite a big-time sync as teams were unfamiliar with other teams' codes, so they needed the help of other teams to debug. This slowed our building down a lot. So implementing dd traces fixed this
Our primary use case for this solution is comprehensive cloud monitoring across our entire infrastructure and application stack. We operate in a multi-cloud environment, utilizing services from AWS, Azure, and Google Cloud Platform. Our applications are predominantly containerized and run on Kubernetes clusters. We have a microservices architecture with dozens of services communicating via REST APIs and message queues. The solution helps us monitor the performance, availability, and resource utilization of our cloud resources, databases, application servers, and front-end applications. It's essential for maintaining high availability, optimizing costs, and ensuring a smooth user experience for our global customer base. We particularly rely on it for real-time monitoring, alerting, and troubleshooting of production issues.
Application Engineer at Discover Financial Services
User
Top 20
2024-06-25T16:25:00Z
Jun 25, 2024
We have a tech stack including all backend services written in TS/Node (mostly) and as a full stack engineer, it is crucial to keep track of new and existing errors. Our logs have been consolidated in Datadog and are accessible for search and review, so the service has become a daily tool for my work. More recently, session replay has been adopted at my company, but I do not like it so much because the UI elements are not in their place, so it is very hard to see what the users on the web app are actually clicking on.
Delivery Manager, DBA Services at a manufacturing company with 10,001+ employees
Real User
Top 20
2023-01-25T15:49:08Z
Jan 25, 2023
We use Datadog for monitoring to get the traces and logs of all our applications. Datadog provides dashboard and alert capabilities to identify if something is wrong with various teams. More than 200 users, mostly software engineers, work with Datadog.
Our primary use case would be using the dashboards and getting proper insights based on the dashboards. The monitoring, SLO, and SLA have been better and easier since we started using the Terraform infrastructure. APM has been easier as we had to enable it through the CronJob directly. Profiling has been made easier. We are able to get many insights into the code. Profiling provides really good insights right now. Logs are the most valuable and the best solution so far. Datadog can help solve any slow queries or database-related errors. The primary use case would be using the dashboards and getting proper insights based on the dashboards.
Software Engineering Manager at a healthcare company with 501-1,000 employees
Real User
Top 20
2022-12-06T21:07:00Z
Dec 6, 2022
We mainly use the product to monitor our infrastructure and apps. It is the go-to tool when we want to check that things are running properly. We use Datadog synthetic monitors to ensure our app works across different locations in the United States. We also have set up Datadog monitors to send alerts if things stop working as expected. We use Continuous Integration Pipeline visibility to make sure our developers are not being blocked by infrastructure and other things that might be out of their control.
Software Developer at a pharma/biotech company with 51-200 employees
Real User
Top 20
2022-12-06T20:54:00Z
Dec 6, 2022
We’re currently using logging, monitoring, metrics, APM, etc. We've started to use e-SLOs, however, it takes a bit of time to work through those. RUM has been very useful. I have used this in the past to debug problems in production, which has been g great. We also want to start using synthetics and tracing more. Our application currently runs in many different environments based on our customers' requirements. This allows us to see everything in one place and filter by environment as required, which is extremely useful.
Product Manager, Delivery Engineering at a media company with 1,001-5,000 employees
Real User
Top 20
2022-12-06T20:48:00Z
Dec 6, 2022
The main use case is observability and reliability as part of a platform/delivery engineering solution. We use the product to assist tenants and clients within the company to get more ramped up on SRE/DevOps.
We use Datadog to monitor our Kubernetes clusters. We have 3 different clusters for different parts of the SDLC. We run the Datadog agent DaemonSet as well as the Datadog cluster agent. Our services have the APM installed by default. To create monitors, we use Terraform. This is provided out-of-the-box for our service owner. We run EKS on top of K8s, therefore, we also make use of some of the AWS monitoring capabilities that can be integrated into Datadog. We are hugely reliant on Datadog for all aspects of our system.
Software Engineering Manager at a hospitality company with 1,001-5,000 employees
Real User
Top 20
2022-12-06T20:16:00Z
Dec 6, 2022
We primarily use the solution for application monitoring (APM, logs, metrics, alerts). It's useful for active monitoring (static monitors, threshold monitors). We get a lot of value out of anomaly detection as well. SLOs and monitoring of SLOs have been another value add. In terms of metrics, the out-of-the-box infrastructure metrics that come with the Datadog agent installation are great. We have made use of both the custom metrics implementation as well as the log-based metrics which are extremely convenient. We also leverage Datadog for use of RUM and want to explore session replay.
Senior Software Engineer at a transportation company with 51-200 employees
Real User
Top 20
2022-12-06T19:56:00Z
Dec 6, 2022
We primarily use Datadog for alerts. If we're running out of database connections or CPU credits we want to find out in Slack. Datadog provides nice features for that. Secondarily, we use Datadog for analyzing historical trends and forecasting potential issues. I'm trying to learn how to add in Continuous Profiler in our primary backend servers and set up Synthetic Tests for monitoring our front end. Everything is mostly on AWS, and the Datadog integrations help a ton.
Atlassian Expert at a tech consulting company with 51-200 employees
Real User
Top 20
2022-12-06T19:50:00Z
Dec 6, 2022
We are providing managed services to our customers across multiple industries. Datadog is key to delivering these services by bringing the observability, monitoring, and alerting capabilities we need to operate at scale. We operate custom cloud native workloads as well as ISV products such as Atlassian Jira or Confluence. Integrating Synthetics, infrastructure, and application performance monitoring, as well as piping all logs through Datadog allows us to operate more with less with good alerting right in time.
Senior Site Reliability Engineer at a tech vendor with 10,001+ employees
Real User
Top 20
2022-12-06T19:42:00Z
Dec 6, 2022
Datadog provides us with a solution for data ingesting for all of our application metrics, resource metrics, APM/tracing data etc. We use it for use in dashboards, monitoring/alerting, SLO targets, incident response etc. We have a lot of applications across multiple languages/frameworks etc., and have deployed in Kubernetes across multiple regions in AWS, along with underlying managed resources such as SQS, Aurora, etc. Datadog makes understanding the state of these seamless. We are a company with millions of daily active users, and this level of detail is excellent.
Software Engineer at a tech vendor with 1,001-5,000 employees
Real User
2022-10-26T05:30:00Z
Oct 26, 2022
We use the solution for application hosting and a little bit of everything when it comes to supporting a worldwide logistics tracking service. It's used as a central service for collecting telemetrics and logs. We find it does the same work as all of our old tools combined, including Prometheus, Kibana, Google Logs, and more; putting all of this information in a single platform makes it easy to corroborate information and associate a request with the data, which might be lost when it is saved as logs.
Software enginneer at a construction company with 1,001-5,000 employees
Real User
2022-10-26T00:18:00Z
Oct 26, 2022
We use the solution for monitoring time spent on views and events triggered. For example, for one of our products, we have created a custom dashboard that lets us track all the custom events and multiple entry points in the same part of the application. Knowing the entry point helps us choose which part of the program should be improved next. It also helps us with collecting important data about the overall usage of each module within our application.
Cloud Specialyst at a financial services firm with 501-1,000 employees
Real User
2022-10-26T00:14:00Z
Oct 26, 2022
We collect all data logs from all operating systems, such as Windows, Linux, VMware, and bare metal data centers. We also automatize the installation of the agent on servers. Now we are starting a POC to analyze the APM module. In the feature, the next step is to do a POC of security modules. The final idea is to have a unique portal for observability. This will make it easy to troubleshoot and for layer levels 1 and 2.
manager at a financial services firm with 501-1,000 employees
Real User
2022-10-26T00:10:00Z
Oct 26, 2022
We use the solution for logs from all our applications. In Datadog, for monitoring logs, our team creates an automation for implementing massive logging in all our systems. Now, we are deploying it in our core systems.
Software Engineer at a comms service provider with 11-50 employees
Real User
2022-10-26T00:06:00Z
Oct 26, 2022
We use different tools for log collection and monitoring. Using Datadog will combine different use cases into one product that will be easier to manage. The tools we use are open-source, so there is no commercial support. Having customer support would be ideal since we're a small team. Profiling would be another great feature to have. Currently, it's manual. Having Datadog would give us a standard, and we don't have to do much manual work.
Devops Engineer II at a comms service provider with 11-50 employees
Real User
2022-10-26T00:02:00Z
Oct 26, 2022
We use the solution for monitoring our logs across distributed clusters. Right now, we have an Elasticsearch solution that is tied to each platform (our product is a PaaS solution). We are looking at moving to a single pane of glass solution, which Datadog would be good for (plus, we could wrap up other tools like Prometheus, Grafana, Pagerduty, Pingdom, and more). We want to be able to have Datadog running on one single cluster and ingesting and processing logs from all our distributed clusters.
Lead Support Engineer at a tech vendor with 11-50 employees
Real User
2022-10-25T23:57:00Z
Oct 25, 2022
Our use case is mainly deploying into our applications for monitoring/logging observability. We currently have our microservices feed into an actuator that exists in each instance of our application that extends to a local and central Grafana for client and internal visibility. The application we use is Grafana. Logging captures application and system logs that are ported to each application instance for querying. Whenever anything occurs that is considered unhealthy from a range of health checks, we have notification rules configured internally and externally for a prompt response time.
Senior IT Manager at a financial services firm with 1,001-5,000 employees
Real User
2022-10-25T23:50:00Z
Oct 25, 2022
The main use cases are to provide visibility to costs for each product in the company as well as to consolidate all the observability in one tool. We are moving the team from being an operational team that needs to keep the tool up and running (applying patches and resolving problems) to a team that is focused on providing meaningful visibility of the systems, applications, and services of the company. We want to add value where the developers and the systems administrators are not able to focus.
Cloud Engineer at a retailer with 51-200 employees
Real User
2022-10-25T23:34:00Z
Oct 25, 2022
I am using the solution for monitoring metrics, logs, traces, etc. It's mainly for making dashboards as well as monitoring our services. We also use Datadog to help centralize our incident management to show the logs, where issues spiked, and some metrics. We use Datadog to do troubleshooting in Kubernetes, specifically in our Azure Kubernetes service. Beyond that, we are looking to use open telemetry in tandem with Datadog to further our log-tracing efforts. In the future, this may be expanded.
Senior Software Engineer at a insurance company with 10,001+ employees
Real User
2022-10-25T23:30:00Z
Oct 25, 2022
I have been using Datadog products and capabilities increasingly over the last 4 years, from POC to widespread adoption. The capabilities we use are unique for each use case and can be combined in various ways to provide the full observability coverage needed to maintain stable operations and shift from becoming more reactive to proactive. Our organization uses both site/service reliability for the range of backend and frontend services, custom monitoring, and dashboards that can be dynamic and reused for multiple teams.
We use actual user monitoring and have set up thresholds for alerts to PagerDuty, Sentry, Slack, and so on. We also have dashboards set up for tracking latency and error rates. As an individual contributor, I also try to set up dashboards for the individual feature projects I work on. I'd like to learn more ways to use this, though, especially when it comes to more proactive approaches to issues. A starter pack of common-use types would be nice.
Infrastructure engineer at a insurance company with 10,001+ employees
Real User
2022-10-25T23:23:00Z
Oct 25, 2022
Our use case is to provide cloud organization application monitoring. I use it for insight into what host in what region has activity or what market is using Datadog to its fullest potential and utilizing that for cost. This may also help determine who is using monitoring and setting alerts or just setting up monitoring and not doing anything about it. The use case can also be to check when the host or applications are down, or if the usage of CPU, memory, etc, is too high.
We use Datadog to view and aggregate logs and monitor all of our services. We have a lot of running infrastructure and it is very convenient to have logs and metrics all aggregated somewhere we can view and chart them. I use Datadog to create dashboards and runbooks, and sharable graphs, which really help out my whole team. We mostly use logs and APM, yet have been starting to use other products. I would like to use more synthetic monitors.
We use the application for our application monitoring, data security monitoring, and log management. What we like about the application is that it helps us to track issues more proactively instead of reactively. There are other improvements we would like to see. 1. Being able to restrict users from seeing or viewing specific dashboards once they log in 2. They can cut down the prices for Cloud SIEM. It seems very useful, however, the prices are high. Some organizations are finding it difficult to make decisions in terms of getting the tool.
Infrastructure Engineer at a tech services company with 11-50 employees
Real User
2022-10-25T22:42:00Z
Oct 25, 2022
We currently use it for log aggregation and SEIM. We send logs from our AWS account (particularly our Cloudtrail and S3 logs) and use them to give us security signals. This has helped with our SOC2 certification process and has given us a window into our processes and the security holes in our system. We are also considering using the APM features to help with our development effort. We want to be able to profile all of our code and see what is going on with it.
We deploy various services for our main platform on AWS across multiple regions. We have a development environment, a staging environment, a QA environment, and a production environment. We deploy our many services across hundreds of instances. We have many server farms, all responsible for various services on our market intelligence platform. The deployment of each server farm or even individual instances varies depending on what stood up. We have instances built in three different ways, with two different pipelines and some even on user data scripts.
Associate at a financial services firm with 10,001+ employees
Real User
2022-10-25T22:34:00Z
Oct 25, 2022
We use the product for recording loggers on our various services across different teams. For example, we use logs to keep track of info logs for events and error logs to catch exceptions. When users ask us to investigate a situation, we use logs to keep track of events and where the user's code traveled to. We also use synthetic testing and monitoring features to keep track of our many alerts in the production and QA environments.
We use Datadog for general observability into our infrastructure, as well as running analytics queries for our SLI/SLO platform. This helps all of our teams be informed of how well their products are actually performing in production, and aim their efforts at the thing that will provide the highest ROI. We also use it for general monitoring and alerting during load tests and service releases to detect any issues related to the deployments. This helps us maintain our high contractual uptime promises to our clients.
Software Engineer at a financial services firm with 10,001+ employees
Real User
2022-10-25T22:23:00Z
Oct 25, 2022
We use it to monitor and alert our ECS instances as well as other AWS services, including DynamoDB, API Gateway, etc. We have it connected to Pagerduty for alerting all our cloud applications. We also use custom RUM monitoring and synthetic tests for both our internal and public-facing websites. For our cloud applications, we can use Datadog to define our SLOs, and SLIs and generate dashboards that are used to monitor SLOs and report them to our senior leadership.
Cloud Engineer at a tech services company with 10,001+ employees
Real User
2022-10-25T22:19:00Z
Oct 25, 2022
We are using the solution for scaling up the website for market data applications. EC2 and Datadog have enabled high-level monitoring of underlying infra and services. The Datadog profiler comes in handy to pinpoint issues with resource utilization during peak hours, and traces/log management helps narrow down the root cause. The network map is crucial in identifying bottlenecks and determining what needs more attention. Host map helps identify problematic hardware and devise ways to counter issues that arise during scaling, and deploying solutions on the cloud.
We are using Datadog for server metrics, log aggregation and searching, system monitoring, alerting the team about errors, and dashboards for our developers. It's used by the Site Reliability Engineering team and Management of all levels. It's assisting us in proving SOC II compliance. We're looking to improve our usage of Datadog's RUM and APM components to get better and more performance insights on our production environments. We're also looking to leverage more synthetic monitors and runbooks for anyone responding to incidents.
Sr Platform Engineer at a pharma/biotech company with 11-50 employees
Real User
2022-10-25T22:11:00Z
Oct 25, 2022
We use it mostly for logging log messages from our Kubernetes and EC2 instances, for example, system messages and errors. Also, we want log messages from our firewalls and other network infrastructure in case of network issues. We intend to use it for application logging, et cetera, to get insight into internal problems in the applications in Kubernetes pods. We want to use it for monitoring in case of system problems and hardware failures so that it can notify us.
We primarily use the solution for monitoring applications and informing customers via Pagerduty and Statuspage. The monitoring and alerts can be personalized internally, and we are able to find problems and issues. The response time monitor has been great, and it has been validating upgrades. We can check in to see which step fails,
Lead Architect at a computer software company with 11-50 employees
Real User
2022-10-25T21:56:00Z
Oct 25, 2022
We primarily use the solution for log management and application performance monitoring. We have been getting into using more solutions on Datadog, such as runbooks, monitoring, and dashboards. Another area that we've been investing some time in is the database monitoring. We've been able to get some relatively new employees onboarded into the tool, and they've been able to create some meaningful dashboards and reports without too much hand-holding at all. We plan on exploring the synthetics solution as well.
Product SRE at a computer software company with 51-200 employees
Real User
2022-10-25T21:52:00Z
Oct 25, 2022
We use Datadog for application logs, error tracking, performance tracking, alerting, and overall production state surveillance. It helps us improve observability and ease of maintenance through better information for our support teams and their issue qualification. We also use dashboards to keep all the information at ready and easy to access. SLOs notably for our uptimes but also our feature usage. It also feeds our alerting for our on-call SREs into PagerDuty by launching alerts when specific parameters are exceeded.
We use Datadog for three main use cases, including: * Infrastructure and application monitoring. It is ensuring that our services are available and performant at all times. This allows us to proactively address incidents and outages without customers contacting us. This includes monitoring of cloud resources (databases, load balancers, CPU usage, etc.), high-level application monitoring (response times, failure rates, etc.), and low-level application monitoring (business-oriented metrics and functional exceptions to customer experience. * Analyzing application behavior, especially around performance. We often use Datadog's application performance monitoring on non-production environments to evaluate the impact of newly introduced features and gain confidence in changes. * End-to-end regression testing for APIs and browser-based experiences. Using Datadog's synthetic testing checks periodically that the system behaves in the exact correct way. This is often used as a canary to detect issues even before users reach them organically.
We primarily use the solution for the RUM, security monitoring, and streams. We need to monitor users and what they access. We also need to identify security loopholes and attack patterns and identify and quickly respond to issues. We can identify pushbacks, and get insight into application components that stack up with each other. We can understand which components, libraries, and code to alert teams. Using Datadog, we can raise incidents, track incidents to completion, and be able to gather data for reporting and post-mortem. The solution allows us to track fixes and tracks their test coverage. With it, we get confidence in the fix/improvement phase and be able to provide a response.
Production engineer at a consultancy with 51-200 employees
Real User
2022-10-25T21:28:00Z
Oct 25, 2022
We have deep integration with Datadog for observability and monitoring. We use everything from APM, logs, and RUM to monitor and dashboards for tracking system health. We are trying to move from many different solutions for error tracking/observability to a single platform (Datadog). We are currently in the process of setting up logging in Datadog in order to maintain our logs better. We are looking to create more insights into the real user flows by using real user monitoring (RUM) too.
We primarily use Datadog for: * Native memory * Logging * APM * Context switching * RUM * Synthetic * Databases * Java * JVM settings * File i/o * Socket i/o * Linux * Kubernetes * Kafka * Pods * Sizing We are testing Datadog as a way to reduce our operational time to fix things (mean time to repair). This is step one. We hope to use Datadog as a way to be proactive instead of reactive (mean time to failure). So far, Datadog has shown very good options to work on all of our operational and development issues. We are also trying to use Datadog to shift left, and fix things before they break (MTTF increase).
Production Engineering at a construction company with 51-200 employees
Real User
2022-10-25T21:04:00Z
Oct 25, 2022
We use Datadog incident management for our incident tooling. Whenever we run into an incident, we try to use it. It allows us to create a separate Slack channel for it.
Senior Cloud Engineer at a comms service provider with 10,001+ employees
Real User
2022-10-25T21:00:00Z
Oct 25, 2022
We use the solution primarily for platform monitoring for the services that are deployed in AWS. It gives a better way to monitor the services, including pods, cost, high availability, etc. This way, observability is ensured and also customer services are uninterrupted. Also, we host the data pipelines between the cloud and the on-prem for which Datadog is used to ensure better services. We report issues based on the metrics reported over it.
Data Engineer II at a comms service provider with 10,001+ employees
Real User
2022-10-25T20:54:00Z
Oct 25, 2022
Ingesting data from various sources to monitor the log metrics of the system and enabling an alert mechanism to notify the teammates if something goes wrong. More specifically, having Datadog agents as integration to different services provides easy access and management.
We use an enterprise version of a CMS platform which is enabling businesses to transmit content to their customers. The tool is fully customizable to the end user, including out-of-the-box integrations as well as APIs for custom plugin support. Our systems fully manage content using AWS as the back-end cloud provider. Assets are kept in secure buckets and utilize the Kubernetes infrastructure to deliver our product to end users and internal authors. Using the CMS allows for business people to manage content without needing development efforts.
DevOps Engineer at a printing company with 51-200 employees
Real User
2022-10-25T20:36:00Z
Oct 25, 2022
Log aggregation for us was a key component since we have a fairly old-school app running on VMs on bare metal. We previously didn't have much insight into our logs unless we manually tunneled them into each server. The solution is reducing manual labor in troubleshooting problems in our environments server by server. We also needed to monitor our Java app and MySQL database to understand their problems so that we could take action and resolve them. Our use cases have since expanded to encompass all aspects of monitoring.
Software engineer at a marketing services firm with 501-1,000 employees
Real User
2022-10-25T20:17:00Z
Oct 25, 2022
We use metrics to track the metrics of our application. We use logging to log any errors or erroneous application behavior as well as successful behavior. We use events to log successful steps in our pipeline or failed steps in our deployment. We use a combination of all these features to diagnose bugs. It makes it much more efficient to look at all the data in one place. This speeds up our development speed so that we can be agile.
Cloud Engineer at a financial services firm with 51-200 employees
Real User
2022-10-25T20:09:00Z
Oct 25, 2022
After a security incident, we needed to find and migrate to a different cloud provider, and after evaluating different competitors and the skill set of the team, we decided to move to AWS. AWS also enables the team to have finer control over how our apps are deployed and how security and access are managed. By leveraging AWS's functionality, we have increased our application's security and sped up the deployment process. We've even been able to handle higher workloads due to AWS's auto-scaling functionality.
Sr. Director of Software Engineering at Globalization Partners
Real User
2022-10-25T19:58:00Z
Oct 25, 2022
The RUM is implemented for customer support session replays to quickly route, triage, and troubleshoot support issues which can be sent to our engineering teams directly. Customer Support will log in directly after receiving a customer request and work on the issue. Engineers will utilize the replay along with RUM to pinpoint the issue combined with APM and Infra trace to be able to look for signals to find the direct cause of the customer impact. Incident management will be utilized to open a Jira ticket for engineering, and it integrates with ITSM systems and on-call as needed.
Architect at a comms service provider with 10,001+ employees
Real User
2022-10-25T19:45:00Z
Oct 25, 2022
We use the solution primarily for distributed tracing, service insight and observability, metrics, and monitoring. We create custom metrics from outbound service calls to trace the availability of back-office systems. We use the flame graph to get insights into our GraphQL implementation. It helps highlight how resolvers work. However, it's lacking in tracing which GraphQL queries are run, and we use custom spans for that.
The product is used for APM solutions for the metrics and traces for the REST API requests and service maps to understand the upstream and downstream services. We are creating dashboards and widgets to monitor the status. We are creating alerts and monitors as well. We integrated the alerts and ticketing system in our organization with SNOW and Netcool. We are using Kubernetes, AWS, and infrastructure metrics. We are using Kafka and Aurora Postgres logs as well, and we are using HTTP status codes to identify the error types.
Lead Software Engineer at a retailer with 51-200 employees
Real User
2022-10-25T17:33:00Z
Oct 25, 2022
We are trying to get a handle on observability. Currently, the overall health of the stack is very anecdotal. Users are reporting issues, and Kubernetes pods are going down. We need to be more scientific and be able to catch problems early and fix them faster. Given the fact that we are a new company, our user base is relatively small, yet growing very fast. We need to predict usage growth better and identify problem implementations that could cause a bottleneck. Our relatively small size has allowed us to be somewhat complacent with performance monitoring. However, we need to have that visibility.
API Developer at a tech services company with 501-1,000 employees
Real User
2022-10-25T09:29:00Z
Oct 25, 2022
We use the solution for monitoring, logging, and alerts. Thanks to Datadog, we report errors using the logger integrated into our services, which is crucial since we only do unit tests. The infrastructure team handles the monitoring part, so I can't give more insights about that. I am an API developer, so I use Datadog mainly for logging. The alerts are connected to Microsoft Teams in a specific channel, and we pay a lot of attention to it, and we usually create tickets based on these alerts.
We use Datadog for observability and system/application health, mainly for product support, triaging, debugging, and incident responses. We use a lot of the logging and the Datadog agent to collect logs, metrics, and traces from our GKE workloads. We use APM and continuous profiling for latency and performance measurement. We use RUM to observe frontend user events, such as tracing on request and what actions they take before errors occur. We also use error tracking and source maps to debug production failures. We are still relatively new to the product, and we are planning to use more of the notebook functionality and power packs to record run books and break knowledge silos. We also need to utilize dashboards and continuous profiling more for performance measurement and integrate Datadog alerts for incident response.
I'm a Datadog partner in Brazil, and I monitor all my applications with Datadog too. I would like to enable all features in my DPN portal and get access to custom demos. We resell Datadog and a full stack of pre-sales, sales, and post-sales services. We have customers for all sectors, including governmental, financial services, services in general, telecom, et cetera. Today, we are the biggest Datadog partner in Brazil, and we are searching for an expansion in our MSP environment.
Sales Engineer at a tech services company with 201-500 employees
Real User
2022-10-24T03:26:00Z
Oct 24, 2022
The solution is primarily used for better understanding the health of applications, modern environments, and many other solutions, which are the main focus of Datadog and many other monitoring tools. With Datadog specifically, I can look at the health of the technology stack and services, and also integrate multiple metric sources, security, business data, and much more. This makes it a real software solution for centralizing data and unifying monitoring silos in one place. Datadog is like a hub - not just a monitoring software.
Test Engineer at a tech services company with 1,001-5,000 employees
Real User
2022-10-24T03:16:00Z
Oct 24, 2022
We're moving towards the cloud yet still have several active data center contracts. As we move to the cloud, we are interested in knowing more about our services, and DataDog APM/logs should give us this perspective. We currently use the infrastructure monitoring part of DataDog. Still, I've really seen the advantage of moving more data into the cloud for comparison and being able to have one place where we can view all related pieces of information regarding a possible incident or potential issue.
Staff Engineer at a tech services company with 1,001-5,000 employees
Real User
2022-10-24T03:12:00Z
Oct 24, 2022
We are using a mixture of on-prem and cloud solutions to bridge the gap with healthcare entities in the service of providing patients with the medication they need to live healthy lives. Since we're a heavily regulated company, a lot of our solutions grew from on-premises monoliths. However, as we scaled out, it became harder and harder to move forward with that architecture. Today, we're investing heavily in transforming our systems from monoliths into distributed systems. With this change in mind, the ability for us to connect the dots using Datadog has been invaluable.
Security Engineering Manager at a financial services firm with 201-500 employees
Real User
2022-10-24T03:04:00Z
Oct 24, 2022
I use the solution to manage security-related logs and metrics, as well as create detection rules for security events. I am a security engineer, so one area of interest is the CSPM product, giving us the ability to look at findings across the cloud environment. The great part about the Datadog security products is that they incorporate the context of the resources/hosts where the security event is found. This allows us to see exactly what is running on a host that we see as a security alert.
Senior Cloud Engineer, Vice President of Monitoring at a financial services firm with 10,001+ employees
Real User
2022-10-24T02:53:00Z
Oct 24, 2022
We are using the solution for migrating out of the data center. Old apps need to be re-architected. We are planning on moving to multi-cloud for disaster recovery and to avoid vendor lockouts. The migration is a mix between an MSP (Infosys) and in-house developers. The hard part is ensuring these apps run the same in the cloud as they do on-premises. Then we also need to ensure that we improve performance when possible. With deadlines approaching quickly it's important not to cut corners - which is why we needed observability
Technical Lead at a wholesaler/distributor with 1,001-5,000 employees
Real User
2022-10-24T02:49:00Z
Oct 24, 2022
We use Datadog for observability and monitoring primarily. Various cross-functional teams have built various dashboards, including Developers, QA, DevOps, and SRE. There are also some dashboards created for senior leadership to keep tabs on days to day activities like cost, scale, issues, etc. Also, we've set up monitors and alarms that kick off when any metrics go beyond the threshold. With Slack and PagerDuty integration, correct team members get alerted and react to solve the issue based on various runbooks.
Staff Cloud Engineer at a energy/utilities company with 51-200 employees
Real User
2022-10-24T02:36:00Z
Oct 24, 2022
We are using the solution for migrating out of the data center. Old apps need to be re-architected. We plan to move to multi-cloud for disaster recovery and avoid vendor lockouts. The migration is a mix between an MSP (Infosys) and in-house devs. The hard part is ensuring these apps run the same in the cloud as they do on-prem. Then we also need to ensure that we improve performance when possible. With deadlines approaching quickly, it is important not to cut corners which is why we needed observability.
SRE at a financial services firm with 10,001+ employees
Real User
2022-10-24T02:28:00Z
Oct 24, 2022
We primarily use the solution for observability, metrics, logs, tracing, and end-to-end user flow monitoring. We are looking to implement this as a company-wide standard for cloud solutions. At this time, we're currently in a POC, and we're interested in using either a Datadog agent or the OTel agent with a Datadog exporter. We have dashboards with panels that correlate metrics and allow you to link through to traces. Flame graphs to show latency across services and the various spans. While we are not security minded, we still require it and are interested in more. It's used for monitoring critical systems.
Senior Manager at a manufacturing company with 10,001+ employees
Real User
2022-10-24T02:20:00Z
Oct 24, 2022
This solution is for physical device monitoring across breweries, including PLCs, HMI Cameras, RFID panels, scales, etc. We want to gain visibility into these devices to influence predictive maintenance and unscheduled downtime. We want to monitor physical devices across the zone from a control tower perspective for end users and support teams alike. Understanding more about the performance of the devices and mechanical components will allow us to schedule downtime to fix imminent catastrophic failures and prevent unplanned downtime and lost revenue.
Software Developer at a pharma/biotech company with 51-200 employees
Real User
Top 20
2022-10-20T16:33:00Z
Oct 20, 2022
We’re currently using logging, monitoring, metrics, APM, etc. We've started to use קSLOs. However, it takes a bit of time to work through those. RUM (Real User Monitoring) has been very useful. I have used this in the past to debug problems in production, which has been great. We also want to start using synthetics and tracing more. Our application currently runs in many different environments based on our customers' requirements. This allows us to see everything in one place and still filter by the environment as required, which is extremely useful.
We primarily use the solution for the service catalog. We use this type of offering for our Microservices applications, and it gives a good view of flow. It is a must when we have different developers working on different services. Having the trace and log features are useful for locating the microservice for the on-call person. We would like to see some more useful applications for health monitoring where we can customize the cases based on data from the database. It needs to have the facility to monitor data inside tables and the status of the UI.
I primarily use the solution to learn, watch and monitor business and engineering metrics in the production and QA environments of my team. We create monitors on key business metrics and observe regressions and anomalies. Less often, I leverage the events ability in Datadog to get notified about significant activities happening in my teams' deployments. We learn about Datadog monitor alerts through Slack and often attempt to create SLOs using Terraform. We use APM for observability. Most recently, I learned about WatchDog Alerts that I will be heavily looking into.
We primarily use the solution for charting application metrics. We use it for all our application metrics, host metrics, and monitors with a PagerDuty integration. We integrate our application logs. It is great to be able to tie our metrics and our traces together. We use the APM module with traces. It is great to be able to link APM, logs, and metrics in one go, as it shortens our troubleshooting and RCA dramatically. We are loving the tool; it is great to have all those insights in one place. We hope that they keep making my life and our engineers' life easier.
Director Of Software Development at Major League Baseball
Real User
2022-10-19T13:22:00Z
Oct 19, 2022
We primarily use the solution for monitoring and telemetry. We use lots of log collections, log-based metrics, and dashboard visualization. The logging, metrics, and APM are vital.
We share dashboards, set up alerts, and monitor everything that happens in our system. We use it in staging, features, production, and our load test environment. It is exceptionally helpful for making our engineering more data-driven. I came from a company that believes we should focus on being telemetry driven. Instilling this in a smaller, less mature engineering organization has been challenging. However, it is much easier while using Datadog.
AWS Cloud Architect Consultant at a manufacturing company with 10,001+ employees
Real User
2022-09-19T20:07:35Z
Sep 19, 2022
Our company deploys the solution for our customers as an observability tool to define SLOs and SLIs along with logs and metrics. The solution includes incident, post-mortem, and root cause analysis that provides a level of truth for incidents and issues with applications. We have SREs and teams in operations, management, and applications who all access to the solution and ensure proper integrations.
Senior Manager - Cloud & DevOps at Publicis Sapient
Real User
2022-02-20T17:26:13Z
Feb 20, 2022
My customers were using Datadog for monitoring purposes. They were using it only because the solution is running on AWS and it's a microservices-based solution. They were using an application called Dynatrace for their log.
AWS Cloud Architect Consultant at a transportation company with 10,001+ employees
Real User
2022-02-09T19:11:59Z
Feb 9, 2022
We are evaluating Datadog for observability and monitoring requirements that we have in our company. In our use case, our intention is to provide some kind of framework for multiple app teams to use the tool for our cyber ability and engineering practices.
Senior Manager, Cyber Digital Transformation at a security firm with 1,001-5,000 employees
Real User
2022-02-08T21:47:54Z
Feb 8, 2022
We have used this solution primarily for application performance monitoring. To do this, we needed to make sure we had the right data in the system so that people could be able to monitor their applications end-to-end.
Head of Digital & Cognitive Services at a tech company with 11-50 employees
Real User
2021-03-23T20:06:32Z
Mar 23, 2021
We use it for monitoring and instrumentation of security. We secure our databases and servers. It is typically for the security of apps, services, and systems. We are using its latest version.
We mostly use it to handle log aggregation, monitor our web application, and alert us on data pipeline failures. Our system is fully on AWS, and so we pipe in all of our Cloudwatch logs into Datadog to have a central place to index and search logs. Our web app is built on an Elastic Beanstalk backend, and we use the Datadog agent to keep track of all of the requests that hit our backend and all of their components. We also use the prebuilt AWS pipeline dashboards to monitor our batch jobs and lambdas.
We primarily use the solution for log monitoring across our entire cloud infra (EB, EC2, Batch, and Lambda). This is in addition to Rstudio Workbench, which has its own logs that would not be picked up via Cloudwatch(docs.rstudio.com). We own several dozen of these servers, and we used to manage instance logs by tailing logs when incidents occurred. Datadog allows for much better visibility across our entire fleet and has saved us countless hours.
Site Reliability Engineer at a financial services firm with 1-10 employees
Real User
2022-10-05T09:22:08Z
Oct 5, 2022
Our company is transitioning to using the solution for monitoring and analytic services we provide to customers. Once fully rolled out, there will be 80-100 users companywide.
Senior Engineer at a educational organization with 5,001-10,000 employees
Real User
2022-08-15T10:42:13Z
Aug 15, 2022
Datadog is a SaaS solution we tried for URL and synthetic monitoring. You record a transaction going into a website and replay that transaction from various locations. Datadog is mainly used by the admin, but three or four other guys had access to the reports and notifications, so it's five altogether. We probably tried no more than 8 percent of what Datadog can do. There are so many other bits and modules. I've only gone into about half of what APM can do in the Datadog stack.
One of the things we use it for is the same thing that we use FullStory for, which is to replay customer interactions with our platform. However, it also does the monitoring. It's like monitoring cloud tools. We're really mostly monitoring our own software to make sure that everything is functioning properly. We can check a bunch of things, and we can even play back customer sessions. It’s basically monitoring our application.
IT Test Manager at a transportation company with 10,001+ employees
Real User
2022-03-29T15:58:56Z
Mar 29, 2022
Our primary use case is log management and we also use the solution for monitoring the application and underlying infrastructure. I'm an IT test manager.
Performance Testing Manager at a tech services company with 10,001+ employees
Real User
2021-12-27T19:28:28Z
Dec 27, 2021
I'm not sure which version we're using, although I believe it to be the latest. We essentially use the solution in advance of performance testing, performance monitoring, and troubleshooting.
Security Analyst at a tech services company with 11-50 employees
Real User
2021-06-01T14:22:04Z
Jun 1, 2021
We are currently testing it. If the testing goes well, we'll purchase the full version, and it will probably be our main monitoring tool. We plan to use it for monitoring our activities and the attacks on our systems or network.
Our clients use it for monitoring applications. Its deployment depends on our customer's use case. It is 100% cloud. We have got a multi-tenant environment, so we segment it out.
Principal Enterprise Systems Engineer at a healthcare company with 10,001+ employees
Real User
2021-02-19T21:40:43Z
Feb 19, 2021
We deploy agents on-premise to collect data on on-premise VM instances. We don't use Datadog in our cloud network. We do have some Cloud apps that we have it on and we also have Containers. We have it on their headquarters, the main software for them is on their own Cloud. Eventually, we're building out the process now and using it better. We plan to use Datadog for root cause analysis relating to any kinds of issues we have with software, with applications going down, latency issues, connection issues, etc. Eventually, we're going to use Datadog for application performance, monitoring, and management. To be proactive around thresholds, alerts, bottlenecks, etc. Our developers and QA teams use this solution. They use it to analyze network traffic, load, CPU load, CPU usage, and then Tracey NPM, API calls for their application. There are roughly 100 users right now. Maybe there's 200 total, but on a given day, maybe 13 people using this solution.
Senior Manager, Site Reliability Engineering at Extra Space Storage
Real User
Top 20
2021-01-25T19:36:00Z
Jan 25, 2021
We primarily use Datadog for logs, APM, infrastructure monitoring, and lambda visibility. We have built a number of critical dashboards that we display within our office for engineers to have a good understanding of the application performance, as well as business partners to understand at a high level the traffic flowing through the app. We started with logging, as our primary monitor, and have shifted to APM to get a deeper understanding of what our system is doing, and how the changes we are making impact the apps.
We primarily use DataDog for performance and log monitoring of cloud environments, which include VMs and Azure Services like Azure compute, storage, network, firewall, and app services via event hubs. Alerting based on monitors via teams and PagerDuty. Logs collection for Azure services like Azure database, Azure Application Gateway, Azure AKS, and other Azure services. Custom metrics using a Python script to collect metrics for components not natively supported by Datadog. Synthetic testing to ensure uptime and browser tests via CI/CD pipeline.
We use Datadog as a monitoring platform to achieve visibility into our container environments. Almost all of our workloads are containerized and with DataDog, we are able to get metrics, logs, alerts, and events about all the containers that we are running. Our developers also extensively use APM to find and diagnose performance issues that might appear. We use Terraform to automatically create all of the necessary monitors and dashboards that our developers need to make sure that our level of service is sufficient.
We primarily use Datadog for the monitoring of EC2 and ECS containers running mostly Rails applications that host a SaaS product. We also monitor ElasticSearch and RDS, and we are working on adding their Application Performance Monitoring solution to monitor our applications directly. We use DataDog to create dashboards, graphs, and alerts based on interesting metrics. DataDog is our first place to look to find the performance of our system. We also use their logging platform and it works well. Especially useful is that the logs and metrics are tightly integrated so you can jump between them easily.
Our primary use of Datadog includes: * Keeping a close look into our AWS resources. Monitoring our multiple RDS and ElastiCache instances play a big role in our indicators. * Kubernetes. We aren't using all of the available Kubernetes integrations but the few of them that work out of the box adds great value to our metrics. * Monitoring and alerting. We wired our most relevant monitoring and alerts to services like PagerDuty, and for the rest of them, we keep our engineers up to date with constant Slack updates.
We were in need of a cloud monitoring tool that was operationally focused on the AWS Platform. We wanted to be able to responsibly and effectively monitor, troubleshoot, and operate the AWS platform, including Server, Network, and key AWS Services. Tooling that highlighted and detected problems, anomalies, and provided best practice recommendations. Tooling that expedites root-cause analysis and performance troubleshooting. Datadog provided us the ability to monitor our cloud infrastructure (network, servers, storage), platform/middleware (database, web/applications servers, business process automation), and business applications across our cloud providers.
We are a solution provider and Datadog is one of the products that I was working on with one of my clients. They are currently evaluating it for use in cloud monitoring. Specifically, Datadog is used for monitoring cloud applications in terms of performance. The logs come into this solution from AWS and it provides dashboards for various environments.
If our app is up and running, we use it to monitor how many credits the app is using up on each node. We also monitor services by how long each call is taking with the help of EC2s off of application.
The primary use case is application monitoring. We also use it set custom metrics and watch our AWS metrics, as well as data. At my current job, I have only use it a couple months. However, I used it for a few years at a previous company.
We are using the infrastructure and app monitoring side, such as process monitoring. We are using it in a very traditional way. We are not using the APM capabilities. When it comes to something like containers, we will generally use it on the host but not inside the container itself. We are using it with our customers and in-house day-to-day.
Datadog is a comprehensive cloud monitoring platform designed to track performance, availability, and log aggregation for cloud resources like AWS, ECS, and Kubernetes. It offers robust tools for creating dashboards, observing user behavior, alerting, telemetry, security monitoring, and synthetic testing.
Datadog supports full observability across cloud providers and environments, enabling troubleshooting, error detection, and performance analysis to maintain system reliability. It...
We use Datadog for monitoring the performance of our infrastructure across multiple types of hosts in multiple environments. We also use APM to monitor our applications in production. We have some Kubernetes clusters and multi-cloud hosts with Datadog agents installed. We have recently added RUM to monitoring our application from the user side, including replay sessions, and are hoping to use those to replace existing monitoring for errors and session replay for debugging issues in the application.
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting. We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge. Datadog agents on each web host, and native integrations with GitHub, AWS, and Azure gets all of our instrumentation and error data in one place for easy analysis and monitoring.
We use the solution to monitor and investigate issues with production services at work. We're periodically reviewing the service catalog view for the various applications and I use it to identify any anomalies with service metrics, any changes in user behavior evident via API calls, and/or spikes in errors. We use monitors to trigger alerts for on-call engineers to act upon. The monitors have set thresholds for request latency, error rates, and throughput. We also use automated rules to block bad actors based on request volume or patterns.
Our primary use case for Datadog is to monitor, analyze, and optimize the performance and health of our applications and infrastructure. We leverage its logging, metrics, and tracing capabilities to pinpoint issues, track system performance, and improve overall reliability. Datadog’s ability to provide real-time insights and alerting on key metrics helps us quickly address issues, ensuring smooth operations. It’s integral for visibility across our microservices architecture and cloud environments.
The product monitors multiple systems, from customer interactions on our web applications down to the database and all layers in between. RUM, APM, logging, and infrastructure monitoring are all surfaced into single dashboards. We initially started with application logs and generated long-term business metrics out of critical logs. We have turned those metrics and logs into a collection of alerts integrated into our pager system. As we have evolved, we have also used APM and RUM data to trigger additional alerts.
Our primary use case for Datadog involves utilizing its dashboards, monitors, and alerts to monitor several key components of our infrastructure. We track the performance of AWS-managed Airflow pipelines, focusing on metrics like data freshness, data volume, pipeline success rates, and overall performance. In addition, we monitor Looker dashboard performance to ensure data is processed efficiently. Database performance is also closely tracked, allowing us to address any potential issues proactively. This setup provides comprehensive observability and ensures that our systems operate smoothly.
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting. We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge. Datadog agents on each web host and native integrations with GitHub, AWS, and Azure get all of our instrumentation and error data in one place for easy analysis and monitoring.
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting. We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. We're managing a hybrid multi-cloud solution across hundreds of applications, which is always a challenge. There are Datadog agents on each web host, and native integrations with GitHub, AWS, and Azure and that gets all of our instrumentation and error data in one place for easy analysis and monitoring.
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting. We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge. Datadog agents on each web host and native integrations with GitHub, AWS, and Azure get all of our instrumentation and error data in one place for easy analysis and monitoring.
We use the solution to monitor production service uptime/downtime, latency, and log storage. Our entire monitoring infrastructure runs off Datadog, so all our alarms are configured with it. We also use it for tracing API performance; what are the biggest regression points. Finally we use it to compare performance on SEO metrics vs competitors. This is a primary use case as SEO dictates our position from google traffic which is a large portion of our customer view generation so it is a vital part of the business we rely on datadog for.
We have several teams and several different projects, all working in tandem, so there are a lot of logs and monitoring that need to be done. We use Datadog mostly for alerting when things go down. We also have several dashboards to keep track of critical operations and to make sure things are running without issues. The Slack messaging is essential in our workflow in letting us know when an alert is triggered. I also appreciate all the graphs you can make, as it gives our team a good overview of how our services are doing.
We currently have an error monitor to monitor errors on our prod environment. Once we hit a certain threshold, we get an alert on Slack. This helps address issues the moment they happen before our users notice. We also utilize synthetic tests on many pages on our site. They're easy to set up and are great for pinpointing when a bug is shipped, but they may take down a less visited page that we aren't immediately aware of. It's a great extra check to make sure the code we ship is free of bugs.
Our company has a microservice architecture, with different teams in charge of different services. Also, it is a start, which means that we have to build fast and move very fast as well. So before we were properly using DD, we often had issues of things breaking, but without much information on where in our system the breaking happened. This was quite a big-time sync as teams were unfamiliar with other teams' codes, so they needed the help of other teams to debug. This slowed our building down a lot. So implementing dd traces fixed this
Our primary use case for this solution is comprehensive cloud monitoring across our entire infrastructure and application stack. We operate in a multi-cloud environment, utilizing services from AWS, Azure, and Google Cloud Platform. Our applications are predominantly containerized and run on Kubernetes clusters. We have a microservices architecture with dozens of services communicating via REST APIs and message queues. The solution helps us monitor the performance, availability, and resource utilization of our cloud resources, databases, application servers, and front-end applications. It's essential for maintaining high availability, optimizing costs, and ensuring a smooth user experience for our global customer base. We particularly rely on it for real-time monitoring, alerting, and troubleshooting of production issues.
We have a tech stack including all backend services written in TS/Node (mostly) and as a full stack engineer, it is crucial to keep track of new and existing errors. Our logs have been consolidated in Datadog and are accessible for search and review, so the service has become a daily tool for my work. More recently, session replay has been adopted at my company, but I do not like it so much because the UI elements are not in their place, so it is very hard to see what the users on the web app are actually clicking on.
Datadog is mainly used to set up alerts and thresholds to monitor real-time metrics and checks.
We use Datadog for monitoring to get the traces and logs of all our applications. Datadog provides dashboard and alert capabilities to identify if something is wrong with various teams. More than 200 users, mostly software engineers, work with Datadog.
Our primary use case would be using the dashboards and getting proper insights based on the dashboards. The monitoring, SLO, and SLA have been better and easier since we started using the Terraform infrastructure. APM has been easier as we had to enable it through the CronJob directly. Profiling has been made easier. We are able to get many insights into the code. Profiling provides really good insights right now. Logs are the most valuable and the best solution so far. Datadog can help solve any slow queries or database-related errors. The primary use case would be using the dashboards and getting proper insights based on the dashboards.
We mainly use the product to monitor our infrastructure and apps. It is the go-to tool when we want to check that things are running properly. We use Datadog synthetic monitors to ensure our app works across different locations in the United States. We also have set up Datadog monitors to send alerts if things stop working as expected. We use Continuous Integration Pipeline visibility to make sure our developers are not being blocked by infrastructure and other things that might be out of their control.
Observability is a key use case, as is security.
We’re currently using logging, monitoring, metrics, APM, etc. We've started to use e-SLOs, however, it takes a bit of time to work through those. RUM has been very useful. I have used this in the past to debug problems in production, which has been g great. We also want to start using synthetics and tracing more. Our application currently runs in many different environments based on our customers' requirements. This allows us to see everything in one place and filter by environment as required, which is extremely useful.
The main use case is observability and reliability as part of a platform/delivery engineering solution. We use the product to assist tenants and clients within the company to get more ramped up on SRE/DevOps.
We primarily use the product for tracing, metrics, and alarms in various deployment environments.
We primarily use the solution for logging and APM, and for real user metrics.
We use Datadog to monitor our Kubernetes clusters. We have 3 different clusters for different parts of the SDLC. We run the Datadog agent DaemonSet as well as the Datadog cluster agent. Our services have the APM installed by default. To create monitors, we use Terraform. This is provided out-of-the-box for our service owner. We run EKS on top of K8s, therefore, we also make use of some of the AWS monitoring capabilities that can be integrated into Datadog. We are hugely reliant on Datadog for all aspects of our system.
We primarily use the solution for monitoring and log analysis.
We primarily use the solution for application monitoring (APM, logs, metrics, alerts). It's useful for active monitoring (static monitors, threshold monitors). We get a lot of value out of anomaly detection as well. SLOs and monitoring of SLOs have been another value add. In terms of metrics, the out-of-the-box infrastructure metrics that come with the Datadog agent installation are great. We have made use of both the custom metrics implementation as well as the log-based metrics which are extremely convenient. We also leverage Datadog for use of RUM and want to explore session replay.
We primarily use Datadog for alerts. If we're running out of database connections or CPU credits we want to find out in Slack. Datadog provides nice features for that. Secondarily, we use Datadog for analyzing historical trends and forecasting potential issues. I'm trying to learn how to add in Continuous Profiler in our primary backend servers and set up Synthetic Tests for monitoring our front end. Everything is mostly on AWS, and the Datadog integrations help a ton.
We are providing managed services to our customers across multiple industries. Datadog is key to delivering these services by bringing the observability, monitoring, and alerting capabilities we need to operate at scale. We operate custom cloud native workloads as well as ISV products such as Atlassian Jira or Confluence. Integrating Synthetics, infrastructure, and application performance monitoring, as well as piping all logs through Datadog allows us to operate more with less with good alerting right in time.
Datadog provides us with a solution for data ingesting for all of our application metrics, resource metrics, APM/tracing data etc. We use it for use in dashboards, monitoring/alerting, SLO targets, incident response etc. We have a lot of applications across multiple languages/frameworks etc., and have deployed in Kubernetes across multiple regions in AWS, along with underlying managed resources such as SQS, Aurora, etc. Datadog makes understanding the state of these seamless. We are a company with millions of daily active users, and this level of detail is excellent.
The product is primarily used for the DevOps team.
We use the solution for application hosting and a little bit of everything when it comes to supporting a worldwide logistics tracking service. It's used as a central service for collecting telemetrics and logs. We find it does the same work as all of our old tools combined, including Prometheus, Kibana, Google Logs, and more; putting all of this information in a single platform makes it easy to corroborate information and associate a request with the data, which might be lost when it is saved as logs.
We use the solution for monitoring time spent on views and events triggered. For example, for one of our products, we have created a custom dashboard that lets us track all the custom events and multiple entry points in the same part of the application. Knowing the entry point helps us choose which part of the program should be improved next. It also helps us with collecting important data about the overall usage of each module within our application.
We collect all data logs from all operating systems, such as Windows, Linux, VMware, and bare metal data centers. We also automatize the installation of the agent on servers. Now we are starting a POC to analyze the APM module. In the feature, the next step is to do a POC of security modules. The final idea is to have a unique portal for observability. This will make it easy to troubleshoot and for layer levels 1 and 2.
We use the solution for logs from all our applications. In Datadog, for monitoring logs, our team creates an automation for implementing massive logging in all our systems. Now, we are deploying it in our core systems.
We use different tools for log collection and monitoring. Using Datadog will combine different use cases into one product that will be easier to manage. The tools we use are open-source, so there is no commercial support. Having customer support would be ideal since we're a small team. Profiling would be another great feature to have. Currently, it's manual. Having Datadog would give us a standard, and we don't have to do much manual work.
We use the solution for monitoring our logs across distributed clusters. Right now, we have an Elasticsearch solution that is tied to each platform (our product is a PaaS solution). We are looking at moving to a single pane of glass solution, which Datadog would be good for (plus, we could wrap up other tools like Prometheus, Grafana, Pagerduty, Pingdom, and more). We want to be able to have Datadog running on one single cluster and ingesting and processing logs from all our distributed clusters.
Our use case is mainly deploying into our applications for monitoring/logging observability. We currently have our microservices feed into an actuator that exists in each instance of our application that extends to a local and central Grafana for client and internal visibility. The application we use is Grafana. Logging captures application and system logs that are ported to each application instance for querying. Whenever anything occurs that is considered unhealthy from a range of health checks, we have notification rules configured internally and externally for a prompt response time.
We use this solution to monitor our Kubernetes clusters, nodes, deployments, daemon sets, replica sets, and pods.
The main use cases are to provide visibility to costs for each product in the company as well as to consolidate all the observability in one tool. We are moving the team from being an operational team that needs to keep the tool up and running (applying patches and resolving problems) to a team that is focused on providing meaningful visibility of the systems, applications, and services of the company. We want to add value where the developers and the systems administrators are not able to focus.
We use the solution for testing all of our application's endpoints. It is making sure that they work on a consistent basis.
I am using the solution for monitoring metrics, logs, traces, etc. It's mainly for making dashboards as well as monitoring our services. We also use Datadog to help centralize our incident management to show the logs, where issues spiked, and some metrics. We use Datadog to do troubleshooting in Kubernetes, specifically in our Azure Kubernetes service. Beyond that, we are looking to use open telemetry in tandem with Datadog to further our log-tracing efforts. In the future, this may be expanded.
I have been using Datadog products and capabilities increasingly over the last 4 years, from POC to widespread adoption. The capabilities we use are unique for each use case and can be combined in various ways to provide the full observability coverage needed to maintain stable operations and shift from becoming more reactive to proactive. Our organization uses both site/service reliability for the range of backend and frontend services, custom monitoring, and dashboards that can be dynamic and reused for multiple teams.
We use actual user monitoring and have set up thresholds for alerts to PagerDuty, Sentry, Slack, and so on. We also have dashboards set up for tracking latency and error rates. As an individual contributor, I also try to set up dashboards for the individual feature projects I work on. I'd like to learn more ways to use this, though, especially when it comes to more proactive approaches to issues. A starter pack of common-use types would be nice.
Our use case is to provide cloud organization application monitoring. I use it for insight into what host in what region has activity or what market is using Datadog to its fullest potential and utilizing that for cost. This may also help determine who is using monitoring and setting alerts or just setting up monitoring and not doing anything about it. The use case can also be to check when the host or applications are down, or if the usage of CPU, memory, etc, is too high.
We use Datadog to view and aggregate logs and monitor all of our services. We have a lot of running infrastructure and it is very convenient to have logs and metrics all aggregated somewhere we can view and chart them. I use Datadog to create dashboards and runbooks, and sharable graphs, which really help out my whole team. We mostly use logs and APM, yet have been starting to use other products. I would like to use more synthetic monitors.
We use the application for our application monitoring, data security monitoring, and log management. What we like about the application is that it helps us to track issues more proactively instead of reactively. There are other improvements we would like to see. 1. Being able to restrict users from seeing or viewing specific dashboards once they log in 2. They can cut down the prices for Cloud SIEM. It seems very useful, however, the prices are high. Some organizations are finding it difficult to make decisions in terms of getting the tool.
We currently use it for log aggregation and SEIM. We send logs from our AWS account (particularly our Cloudtrail and S3 logs) and use them to give us security signals. This has helped with our SOC2 certification process and has given us a window into our processes and the security holes in our system. We are also considering using the APM features to help with our development effort. We want to be able to profile all of our code and see what is going on with it.
We deploy various services for our main platform on AWS across multiple regions. We have a development environment, a staging environment, a QA environment, and a production environment. We deploy our many services across hundreds of instances. We have many server farms, all responsible for various services on our market intelligence platform. The deployment of each server farm or even individual instances varies depending on what stood up. We have instances built in three different ways, with two different pipelines and some even on user data scripts.
We use the product for recording loggers on our various services across different teams. For example, we use logs to keep track of info logs for events and error logs to catch exceptions. When users ask us to investigate a situation, we use logs to keep track of events and where the user's code traveled to. We also use synthetic testing and monitoring features to keep track of our many alerts in the production and QA environments.
We use Datadog for general observability into our infrastructure, as well as running analytics queries for our SLI/SLO platform. This helps all of our teams be informed of how well their products are actually performing in production, and aim their efforts at the thing that will provide the highest ROI. We also use it for general monitoring and alerting during load tests and service releases to detect any issues related to the deployments. This helps us maintain our high contractual uptime promises to our clients.
We use it to monitor and alert our ECS instances as well as other AWS services, including DynamoDB, API Gateway, etc. We have it connected to Pagerduty for alerting all our cloud applications. We also use custom RUM monitoring and synthetic tests for both our internal and public-facing websites. For our cloud applications, we can use Datadog to define our SLOs, and SLIs and generate dashboards that are used to monitor SLOs and report them to our senior leadership.
We are using the solution for scaling up the website for market data applications. EC2 and Datadog have enabled high-level monitoring of underlying infra and services. The Datadog profiler comes in handy to pinpoint issues with resource utilization during peak hours, and traces/log management helps narrow down the root cause. The network map is crucial in identifying bottlenecks and determining what needs more attention. Host map helps identify problematic hardware and devise ways to counter issues that arise during scaling, and deploying solutions on the cloud.
We are using Datadog for server metrics, log aggregation and searching, system monitoring, alerting the team about errors, and dashboards for our developers. It's used by the Site Reliability Engineering team and Management of all levels. It's assisting us in proving SOC II compliance. We're looking to improve our usage of Datadog's RUM and APM components to get better and more performance insights on our production environments. We're also looking to leverage more synthetic monitors and runbooks for anyone responding to incidents.
We use it mostly for logging log messages from our Kubernetes and EC2 instances, for example, system messages and errors. Also, we want log messages from our firewalls and other network infrastructure in case of network issues. We intend to use it for application logging, et cetera, to get insight into internal problems in the applications in Kubernetes pods. We want to use it for monitoring in case of system problems and hardware failures so that it can notify us.
We primarily use the solution for monitoring applications and informing customers via Pagerduty and Statuspage. The monitoring and alerts can be personalized internally, and we are able to find problems and issues. The response time monitor has been great, and it has been validating upgrades. We can check in to see which step fails,
We primarily use the solution for log management and application performance monitoring. We have been getting into using more solutions on Datadog, such as runbooks, monitoring, and dashboards. Another area that we've been investing some time in is the database monitoring. We've been able to get some relatively new employees onboarded into the tool, and they've been able to create some meaningful dashboards and reports without too much hand-holding at all. We plan on exploring the synthetics solution as well.
We use Datadog for application logs, error tracking, performance tracking, alerting, and overall production state surveillance. It helps us improve observability and ease of maintenance through better information for our support teams and their issue qualification. We also use dashboards to keep all the information at ready and easy to access. SLOs notably for our uptimes but also our feature usage. It also feeds our alerting for our on-call SREs into PagerDuty by launching alerts when specific parameters are exceeded.
We use Datadog for three main use cases, including: * Infrastructure and application monitoring. It is ensuring that our services are available and performant at all times. This allows us to proactively address incidents and outages without customers contacting us. This includes monitoring of cloud resources (databases, load balancers, CPU usage, etc.), high-level application monitoring (response times, failure rates, etc.), and low-level application monitoring (business-oriented metrics and functional exceptions to customer experience. * Analyzing application behavior, especially around performance. We often use Datadog's application performance monitoring on non-production environments to evaluate the impact of newly introduced features and gain confidence in changes. * End-to-end regression testing for APIs and browser-based experiences. Using Datadog's synthetic testing checks periodically that the system behaves in the exact correct way. This is often used as a canary to detect issues even before users reach them organically.
We primarily use the solution for the RUM, security monitoring, and streams. We need to monitor users and what they access. We also need to identify security loopholes and attack patterns and identify and quickly respond to issues. We can identify pushbacks, and get insight into application components that stack up with each other. We can understand which components, libraries, and code to alert teams. Using Datadog, we can raise incidents, track incidents to completion, and be able to gather data for reporting and post-mortem. The solution allows us to track fixes and tracks their test coverage. With it, we get confidence in the fix/improvement phase and be able to provide a response.
We have deep integration with Datadog for observability and monitoring. We use everything from APM, logs, and RUM to monitor and dashboards for tracking system health. We are trying to move from many different solutions for error tracking/observability to a single platform (Datadog). We are currently in the process of setting up logging in Datadog in order to maintain our logs better. We are looking to create more insights into the real user flows by using real user monitoring (RUM) too.
We primarily use Datadog for: * Native memory * Logging * APM * Context switching * RUM * Synthetic * Databases * Java * JVM settings * File i/o * Socket i/o * Linux * Kubernetes * Kafka * Pods * Sizing We are testing Datadog as a way to reduce our operational time to fix things (mean time to repair). This is step one. We hope to use Datadog as a way to be proactive instead of reactive (mean time to failure). So far, Datadog has shown very good options to work on all of our operational and development issues. We are also trying to use Datadog to shift left, and fix things before they break (MTTF increase).
We use Datadog incident management for our incident tooling. Whenever we run into an incident, we try to use it. It allows us to create a separate Slack channel for it.
We use the solution primarily for platform monitoring for the services that are deployed in AWS. It gives a better way to monitor the services, including pods, cost, high availability, etc. This way, observability is ensured and also customer services are uninterrupted. Also, we host the data pipelines between the cloud and the on-prem for which Datadog is used to ensure better services. We report issues based on the metrics reported over it.
Ingesting data from various sources to monitor the log metrics of the system and enabling an alert mechanism to notify the teammates if something goes wrong. More specifically, having Datadog agents as integration to different services provides easy access and management.
We use an enterprise version of a CMS platform which is enabling businesses to transmit content to their customers. The tool is fully customizable to the end user, including out-of-the-box integrations as well as APIs for custom plugin support. Our systems fully manage content using AWS as the back-end cloud provider. Assets are kept in secure buckets and utilize the Kubernetes infrastructure to deliver our product to end users and internal authors. Using the CMS allows for business people to manage content without needing development efforts.
Log aggregation for us was a key component since we have a fairly old-school app running on VMs on bare metal. We previously didn't have much insight into our logs unless we manually tunneled them into each server. The solution is reducing manual labor in troubleshooting problems in our environments server by server. We also needed to monitor our Java app and MySQL database to understand their problems so that we could take action and resolve them. Our use cases have since expanded to encompass all aspects of monitoring.
We use metrics to track the metrics of our application. We use logging to log any errors or erroneous application behavior as well as successful behavior. We use events to log successful steps in our pipeline or failed steps in our deployment. We use a combination of all these features to diagnose bugs. It makes it much more efficient to look at all the data in one place. This speeds up our development speed so that we can be agile.
After a security incident, we needed to find and migrate to a different cloud provider, and after evaluating different competitors and the skill set of the team, we decided to move to AWS. AWS also enables the team to have finer control over how our apps are deployed and how security and access are managed. By leveraging AWS's functionality, we have increased our application's security and sped up the deployment process. We've even been able to handle higher workloads due to AWS's auto-scaling functionality.
The RUM is implemented for customer support session replays to quickly route, triage, and troubleshoot support issues which can be sent to our engineering teams directly. Customer Support will log in directly after receiving a customer request and work on the issue. Engineers will utilize the replay along with RUM to pinpoint the issue combined with APM and Infra trace to be able to look for signals to find the direct cause of the customer impact. Incident management will be utilized to open a Jira ticket for engineering, and it integrates with ITSM systems and on-call as needed.
We use the solution primarily for distributed tracing, service insight and observability, metrics, and monitoring. We create custom metrics from outbound service calls to trace the availability of back-office systems. We use the flame graph to get insights into our GraphQL implementation. It helps highlight how resolvers work. However, it's lacking in tracing which GraphQL queries are run, and we use custom spans for that.
The product is used for APM solutions for the metrics and traces for the REST API requests and service maps to understand the upstream and downstream services. We are creating dashboards and widgets to monitor the status. We are creating alerts and monitors as well. We integrated the alerts and ticketing system in our organization with SNOW and Netcool. We are using Kubernetes, AWS, and infrastructure metrics. We are using Kafka and Aurora Postgres logs as well, and we are using HTTP status codes to identify the error types.
We are trying to get a handle on observability. Currently, the overall health of the stack is very anecdotal. Users are reporting issues, and Kubernetes pods are going down. We need to be more scientific and be able to catch problems early and fix them faster. Given the fact that we are a new company, our user base is relatively small, yet growing very fast. We need to predict usage growth better and identify problem implementations that could cause a bottleneck. Our relatively small size has allowed us to be somewhat complacent with performance monitoring. However, we need to have that visibility.
We use the solution for monitoring, logging, and alerts. Thanks to Datadog, we report errors using the logger integrated into our services, which is crucial since we only do unit tests. The infrastructure team handles the monitoring part, so I can't give more insights about that. I am an API developer, so I use Datadog mainly for logging. The alerts are connected to Microsoft Teams in a specific channel, and we pay a lot of attention to it, and we usually create tickets based on these alerts.
We use Datadog for observability and system/application health, mainly for product support, triaging, debugging, and incident responses. We use a lot of the logging and the Datadog agent to collect logs, metrics, and traces from our GKE workloads. We use APM and continuous profiling for latency and performance measurement. We use RUM to observe frontend user events, such as tracing on request and what actions they take before errors occur. We also use error tracking and source maps to debug production failures. We are still relatively new to the product, and we are planning to use more of the notebook functionality and power packs to record run books and break knowledge silos. We also need to utilize dashboards and continuous profiling more for performance measurement and integrate Datadog alerts for incident response.
I'm a Datadog partner in Brazil, and I monitor all my applications with Datadog too. I would like to enable all features in my DPN portal and get access to custom demos. We resell Datadog and a full stack of pre-sales, sales, and post-sales services. We have customers for all sectors, including governmental, financial services, services in general, telecom, et cetera. Today, we are the biggest Datadog partner in Brazil, and we are searching for an expansion in our MSP environment.
The solution is primarily used for better understanding the health of applications, modern environments, and many other solutions, which are the main focus of Datadog and many other monitoring tools. With Datadog specifically, I can look at the health of the technology stack and services, and also integrate multiple metric sources, security, business data, and much more. This makes it a real software solution for centralizing data and unifying monitoring silos in one place. Datadog is like a hub - not just a monitoring software.
We're moving towards the cloud yet still have several active data center contracts. As we move to the cloud, we are interested in knowing more about our services, and DataDog APM/logs should give us this perspective. We currently use the infrastructure monitoring part of DataDog. Still, I've really seen the advantage of moving more data into the cloud for comparison and being able to have one place where we can view all related pieces of information regarding a possible incident or potential issue.
We are using a mixture of on-prem and cloud solutions to bridge the gap with healthcare entities in the service of providing patients with the medication they need to live healthy lives. Since we're a heavily regulated company, a lot of our solutions grew from on-premises monoliths. However, as we scaled out, it became harder and harder to move forward with that architecture. Today, we're investing heavily in transforming our systems from monoliths into distributed systems. With this change in mind, the ability for us to connect the dots using Datadog has been invaluable.
We primarily use the solution for security monitoring and anomaly detection.
I use the solution to manage security-related logs and metrics, as well as create detection rules for security events. I am a security engineer, so one area of interest is the CSPM product, giving us the ability to look at findings across the cloud environment. The great part about the Datadog security products is that they incorporate the context of the resources/hosts where the security event is found. This allows us to see exactly what is running on a host that we see as a security alert.
We are using the solution for migrating out of the data center. Old apps need to be re-architected. We are planning on moving to multi-cloud for disaster recovery and to avoid vendor lockouts. The migration is a mix between an MSP (Infosys) and in-house developers. The hard part is ensuring these apps run the same in the cloud as they do on-premises. Then we also need to ensure that we improve performance when possible. With deadlines approaching quickly it's important not to cut corners - which is why we needed observability
We use Datadog for observability and monitoring primarily. Various cross-functional teams have built various dashboards, including Developers, QA, DevOps, and SRE. There are also some dashboards created for senior leadership to keep tabs on days to day activities like cost, scale, issues, etc. Also, we've set up monitors and alarms that kick off when any metrics go beyond the threshold. With Slack and PagerDuty integration, correct team members get alerted and react to solve the issue based on various runbooks.
We are using the solution for migrating out of the data center. Old apps need to be re-architected. We plan to move to multi-cloud for disaster recovery and avoid vendor lockouts. The migration is a mix between an MSP (Infosys) and in-house devs. The hard part is ensuring these apps run the same in the cloud as they do on-prem. Then we also need to ensure that we improve performance when possible. With deadlines approaching quickly, it is important not to cut corners which is why we needed observability.
We primarily use the solution for observability, metrics, logs, tracing, and end-to-end user flow monitoring. We are looking to implement this as a company-wide standard for cloud solutions. At this time, we're currently in a POC, and we're interested in using either a Datadog agent or the OTel agent with a Datadog exporter. We have dashboards with panels that correlate metrics and allow you to link through to traces. Flame graphs to show latency across services and the various spans. While we are not security minded, we still require it and are interested in more. It's used for monitoring critical systems.
This solution is for physical device monitoring across breweries, including PLCs, HMI Cameras, RFID panels, scales, etc. We want to gain visibility into these devices to influence predictive maintenance and unscheduled downtime. We want to monitor physical devices across the zone from a control tower perspective for end users and support teams alike. Understanding more about the performance of the devices and mechanical components will allow us to schedule downtime to fix imminent catastrophic failures and prevent unplanned downtime and lost revenue.
We’re currently using logging, monitoring, metrics, APM, etc. We've started to use קSLOs. However, it takes a bit of time to work through those. RUM (Real User Monitoring) has been very useful. I have used this in the past to debug problems in production, which has been great. We also want to start using synthetics and tracing more. Our application currently runs in many different environments based on our customers' requirements. This allows us to see everything in one place and still filter by the environment as required, which is extremely useful.
We primarily use the solution for the service catalog. We use this type of offering for our Microservices applications, and it gives a good view of flow. It is a must when we have different developers working on different services. Having the trace and log features are useful for locating the microservice for the on-call person. We would like to see some more useful applications for health monitoring where we can customize the cases based on data from the database. It needs to have the facility to monitor data inside tables and the status of the UI.
I primarily use the solution to learn, watch and monitor business and engineering metrics in the production and QA environments of my team. We create monitors on key business metrics and observe regressions and anomalies. Less often, I leverage the events ability in Datadog to get notified about significant activities happening in my teams' deployments. We learn about Datadog monitor alerts through Slack and often attempt to create SLOs using Terraform. We use APM for observability. Most recently, I learned about WatchDog Alerts that I will be heavily looking into.
We primarily use the solution for observability.
We primarily use the solution for charting application metrics. We use it for all our application metrics, host metrics, and monitors with a PagerDuty integration. We integrate our application logs. It is great to be able to tie our metrics and our traces together. We use the APM module with traces. It is great to be able to link APM, logs, and metrics in one go, as it shortens our troubleshooting and RCA dramatically. We are loving the tool; it is great to have all those insights in one place. We hope that they keep making my life and our engineers' life easier.
We primarily use the solution for monitoring and telemetry. We use lots of log collections, log-based metrics, and dashboard visualization. The logging, metrics, and APM are vital.
We share dashboards, set up alerts, and monitor everything that happens in our system. We use it in staging, features, production, and our load test environment. It is exceptionally helpful for making our engineering more data-driven. I came from a company that believes we should focus on being telemetry driven. Instilling this in a smaller, less mature engineering organization has been challenging. However, it is much easier while using Datadog.
We are using the solution from a monitoring and management perspective. We use it for alerts.
Our company deploys the solution for our customers as an observability tool to define SLOs and SLIs along with logs and metrics. The solution includes incident, post-mortem, and root cause analysis that provides a level of truth for incidents and issues with applications. We have SREs and teams in operations, management, and applications who all access to the solution and ensure proper integrations.
My customers were using Datadog for monitoring purposes. They were using it only because the solution is running on AWS and it's a microservices-based solution. They were using an application called Dynatrace for their log.
We are evaluating Datadog for observability and monitoring requirements that we have in our company. In our use case, our intention is to provide some kind of framework for multiple app teams to use the tool for our cyber ability and engineering practices.
We have used this solution primarily for application performance monitoring. To do this, we needed to make sure we had the right data in the system so that people could be able to monitor their applications end-to-end.
We use it for our infrastructure network and servers.
We use it for monitoring and instrumentation of security. We secure our databases and servers. It is typically for the security of apps, services, and systems. We are using its latest version.
We're in the process of doing a Proof of Concept with the solution right now.
We mostly use it to handle log aggregation, monitor our web application, and alert us on data pipeline failures. Our system is fully on AWS, and so we pipe in all of our Cloudwatch logs into Datadog to have a central place to index and search logs. Our web app is built on an Elastic Beanstalk backend, and we use the Datadog agent to keep track of all of the requests that hit our backend and all of their components. We also use the prebuilt AWS pipeline dashboards to monitor our batch jobs and lambdas.
We primarily use the solution for log monitoring across our entire cloud infra (EB, EC2, Batch, and Lambda). This is in addition to Rstudio Workbench, which has its own logs that would not be picked up via Cloudwatch(docs.rstudio.com). We own several dozen of these servers, and we used to manage instance logs by tailing logs when incidents occurred. Datadog allows for much better visibility across our entire fleet and has saved us countless hours.
We primarily use the dashboard/metric with many tags.
Our company is transitioning to using the solution for monitoring and analytic services we provide to customers. Once fully rolled out, there will be 80-100 users companywide.
We use this solution for our customer's IP and to support their cloud infrastructure.
Datadog is a SaaS solution we tried for URL and synthetic monitoring. You record a transaction going into a website and replay that transaction from various locations. Datadog is mainly used by the admin, but three or four other guys had access to the reports and notifications, so it's five altogether. We probably tried no more than 8 percent of what Datadog can do. There are so many other bits and modules. I've only gone into about half of what APM can do in the Datadog stack.
One of the things we use it for is the same thing that we use FullStory for, which is to replay customer interactions with our platform. However, it also does the monitoring. It's like monitoring cloud tools. We're really mostly monitoring our own software to make sure that everything is functioning properly. We can check a bunch of things, and we can even play back customer sessions. It’s basically monitoring our application.
The solution is basically used for servers and applications.
Our primary use case is log management and we also use the solution for monitoring the application and underlying infrastructure. I'm an IT test manager.
We use Datadog to monitor our product on the cloud.
I'm not sure which version we're using, although I believe it to be the latest. We essentially use the solution in advance of performance testing, performance monitoring, and troubleshooting.
We have a web infrastructure that uses Amazon Web Services containers with everything included, and we use Datadog to monitor them all.
We implement these solutions for our clients. We have implemented Datadog as an SIEM solution.
I am using Datadog for error reporting.
I used Datadog typically for monitoring website statistics and some of the cloud networking equipment.
We are currently testing it. If the testing goes well, we'll purchase the full version, and it will probably be our main monitoring tool. We plan to use it for monitoring our activities and the attacks on our systems or network.
I implement this solution for clients.
Our clients use it for monitoring applications. Its deployment depends on our customer's use case. It is 100% cloud. We have got a multi-tenant environment, so we segment it out.
We deploy agents on-premise to collect data on on-premise VM instances. We don't use Datadog in our cloud network. We do have some Cloud apps that we have it on and we also have Containers. We have it on their headquarters, the main software for them is on their own Cloud. Eventually, we're building out the process now and using it better. We plan to use Datadog for root cause analysis relating to any kinds of issues we have with software, with applications going down, latency issues, connection issues, etc. Eventually, we're going to use Datadog for application performance, monitoring, and management. To be proactive around thresholds, alerts, bottlenecks, etc. Our developers and QA teams use this solution. They use it to analyze network traffic, load, CPU load, CPU usage, and then Tracey NPM, API calls for their application. There are roughly 100 users right now. Maybe there's 200 total, but on a given day, maybe 13 people using this solution.
We primarily use Datadog for logs, APM, infrastructure monitoring, and lambda visibility. We have built a number of critical dashboards that we display within our office for engineers to have a good understanding of the application performance, as well as business partners to understand at a high level the traffic flowing through the app. We started with logging, as our primary monitor, and have shifted to APM to get a deeper understanding of what our system is doing, and how the changes we are making impact the apps.
We primarily use DataDog for performance and log monitoring of cloud environments, which include VMs and Azure Services like Azure compute, storage, network, firewall, and app services via event hubs. Alerting based on monitors via teams and PagerDuty. Logs collection for Azure services like Azure database, Azure Application Gateway, Azure AKS, and other Azure services. Custom metrics using a Python script to collect metrics for components not natively supported by Datadog. Synthetic testing to ensure uptime and browser tests via CI/CD pipeline.
We use Datadog as a monitoring platform to achieve visibility into our container environments. Almost all of our workloads are containerized and with DataDog, we are able to get metrics, logs, alerts, and events about all the containers that we are running. Our developers also extensively use APM to find and diagnose performance issues that might appear. We use Terraform to automatically create all of the necessary monitors and dashboards that our developers need to make sure that our level of service is sufficient.
We primarily use this product for availability and performance monitoring, log aggregation.
We primarily use Datadog for the monitoring of EC2 and ECS containers running mostly Rails applications that host a SaaS product. We also monitor ElasticSearch and RDS, and we are working on adding their Application Performance Monitoring solution to monitor our applications directly. We use DataDog to create dashboards, graphs, and alerts based on interesting metrics. DataDog is our first place to look to find the performance of our system. We also use their logging platform and it works well. Especially useful is that the logs and metrics are tightly integrated so you can jump between them easily.
Our primary use of Datadog includes: * Keeping a close look into our AWS resources. Monitoring our multiple RDS and ElastiCache instances play a big role in our indicators. * Kubernetes. We aren't using all of the available Kubernetes integrations but the few of them that work out of the box adds great value to our metrics. * Monitoring and alerting. We wired our most relevant monitoring and alerts to services like PagerDuty, and for the rest of them, we keep our engineers up to date with constant Slack updates.
We were in need of a cloud monitoring tool that was operationally focused on the AWS Platform. We wanted to be able to responsibly and effectively monitor, troubleshoot, and operate the AWS platform, including Server, Network, and key AWS Services. Tooling that highlighted and detected problems, anomalies, and provided best practice recommendations. Tooling that expedites root-cause analysis and performance troubleshooting. Datadog provided us the ability to monitor our cloud infrastructure (network, servers, storage), platform/middleware (database, web/applications servers, business process automation), and business applications across our cloud providers.
We use Datadog for application monitor, to help identify errors. It is also used to monitor application performance.
I'm a senior cloud security engineer and we are customers of Datadog.
We used Datadog to capture the salvatory of our AWS fleet of around 1,200 servers.
We are a solution provider and Datadog is one of the products that I was working on with one of my clients. They are currently evaluating it for use in cloud monitoring. Specifically, Datadog is used for monitoring cloud applications in terms of performance. The logs come into this solution from AWS and it provides dashboards for various environments.
* Monitoring * Analytics * Tracing * APM
If our app is up and running, we use it to monitor how many credits the app is using up on each node. We also monitor services by how long each call is taking with the help of EC2s off of application.
We use it for notifications, alerting, and capturing most of the information from Amazon, such as EC2 instances.
The primary use case is application monitoring. We also use it set custom metrics and watch our AWS metrics, as well as data. At my current job, I have only use it a couple months. However, I used it for a few years at a previous company.
We are using the infrastructure and app monitoring side, such as process monitoring. We are using it in a very traditional way. We are not using the APM capabilities. When it comes to something like containers, we will generally use it on the host but not inside the container itself. We are using it with our customers and in-house day-to-day.
We mainly use it to send metrics about CV and memory usage, in addition to the number of files descriptors on a socket.
We use it to monitor our infrastructure, particularly our different EC2 instances, and our containers. We also use it to capture our logs.
We use it to store editorial content. We started out on the on-premise version, then moved to the AWS version.
We use it for custom metrics of our applications and monitoring of our systems.