What is our primary use case?
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting.
We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications.
Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge.
Datadog agents on each web host and native integrations with GitHub, AWS, and Azure get all of our instrumentation and error data in one place for easy analysis and monitoring.
How has it helped my organization?
Using Datadog across all of our apps, we were able to consolidate a number of alerting and error-tracking apps, and Datadog ties them all together in cohesive dashboards.
Whether the app is vendor-supplied or we built it ourselves, the depth of tracing, profiling, and hooking into logs is all obtainable and tunable. Both legacy .NET Framework and Windows Event Viewer and cutting-edge .NET Core with streaming logs all work.
The breadth of coverage for any app type or situation is really incredible. It feels like there's nothing we can't monitor.
What is most valuable?
When it comes to Datadog, several features have proven particularly valuable. For example, the centralized pipeline tracking and error logging provide a comprehensive view of our development and deployment processes, making it much easier to identify and resolve issues quickly.
Synthetic testing has been a game-changer, allowing us to catch potential problems before they impact real users.
Real user monitoring gives us invaluable insights into actual user experiences, helping us prioritize improvements where they matter most. And the ability to create custom dashboards has been incredibly useful, allowing us to visualize key metrics and KPIs in a way that makes sense for different teams and stakeholders.
Together, these features form a powerful toolkit that helps us maintain high performance and reliability across our applications and infrastructure, ultimately leading to better user satisfaction and more efficient operations.
What needs improvement?
They need an expansion of the Android and IOS apps to provide a simplified CI/CD pipeline history view.
I like the idea of monitoring on the go. That said, it seems the options are still a bit limited out of the box.
While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS hosted apps - that need a lot of focus to pick up on the key details needed.
In some cases the screenshots don't match the text as updates are made. I spent longer than I should figuring out how to correlate logs to traces, mostly related to environmental variables.
For how long have I used the solution?
I've used the solution for about three years.
What do I think about the stability of the solution?
We have been impressed with the uptime and clean and light resource usage of the agents.
What do I think about the scalability of the solution?
The solution has been very scalable and very customizable.
How are customer service and support?
Support is always helpful to help us tune our committed costs and alert us when we start spending out of the on-demand budget.
Which solution did I use previously and why did I switch?
We used a mix of a custom error email system, SolarWinds, UptimeRobot, and GitHub actions. We switched to find one platform that could give deep app visibility regardless of Linux or Windows or Container, cloud or on-prem hosted.
How was the initial setup?
The implementation is generally simple. That said, .NET Profiling of IIS and aligning logs to traces and profiles was a challenge.
What about the implementation team?
The solution was implemented in-house.
What was our ROI?
Our ROI has been significant time saved by the development team assessing bugs and performance issues.
What's my experience with pricing, setup cost, and licensing?
Set up live trials to asses cost scaling. Small decisions around how monitors are used can impact cost scaling.
Which other solutions did I evaluate?
NewRelic was considered. LogicMonitor was chosen over Datadog for our network and campus server management use cases.
What other advice do I have?
We are excited to explore the new offerings around LLM further and continue to expand our presence in Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Microsoft Azure
Disclosure: I am a real user, and this review is based on my own experience and opinions.