What is our primary use case?
We use Datadog across the enterprise for observability of infrastructure, APM, RUM, SLO management, alert management and monitoring, and other features. We're also planning on using the upcoming cloud cost management features and product analytics.
For infrastructure, we integrate with our Kube systems to show all hosts and their data.
For APM, we use it with all of our API and worker services, as well as cronjobs and other Kube deployments.
We use serverless to monitor our Cloud Functions.
We use RUM for all of our user interfaces, including web and mobile.
How has it helped my organization?
It's given us the observability we need to see what's happening in our systems, end to end. We get full stack visibility from APM and RUM, through to logging and infrastructure/host visibility. It's also becoming the basis of our incident management process in conjunction with PagerDuty.
APM is probably the most prominent place where it has helped us. APM gives us detailed data on service performance, including latency and request count. This drives all of the work that we do on SLOs and SLAs.
RUM is also prominent and is becoming the basis of our product team's vision of how our software is actually used.
What is most valuable?
APM is a fundamental part of our service management, both for viewing problems and improving latency and uptime. The latency views drive our SLOs and help us identify problems.
We also use APM and metrics to view the status of our Pub/Sub topics and queues, especially when dealing with undelivered messages.
RUM has been critical in identifying what our users are actually doing, and we'll be using the new product analytics tools to research and drive new feature development.
All of this feeds into the PagerDuty integration, which we use to drive our incident management process.
What needs improvement?
Sometimes thesolution changes features so quickly that the UI keeps moving around. The cost is pretty high. Outside of that, we've been relatively happy.
The APM service catalog is evolving fast. That said, it is redundant with our other tools and doesn't allow us to manage software maturity. However, we do link it with our other tools using the APIs, so that's helpful.
Product analytics is relatively new and based on RUM, so it will be interesting to see how it evolves.
Sometimes some of the graphs take a while to load, based on the window of data.
Some stock dashboards don't allow customization. You need to clone them first, but this can lead to an abundance of dashboards. Also, there are some things that stock dashboards do that can't yet be duplicated with custom dashboards, especially around widget organization.
The "top users" widget on the product analytics page only groups by user email, which is unfortunate, since user ID is the field we use to identify our users.
For how long have I used the solution?
I've used the solution for three and a half years.
What do I think about the stability of the solution?
The solution is pretty stable.
What do I think about the scalability of the solution?
The solution is very scalable.
How are customer service and support?
Support was excellent during the sales process, with a huge dropoff after we purchased the product. It has only recently (within the past year) they have begun to reach acceptable levels again.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We did not have a global solution. Some teams were using New Relic.
How was the initial setup?
The instructions aren't always clear, especially when dealing with multiple products across multiple languages. The tracer works very differently from one language to another.
What about the implementation team?
We handled the setup in-house.
What's my experience with pricing, setup cost, and licensing?
We have built our own set of installation instructions for our teams, to ensure consistent tagging and APM setup.
Which other solutions did I evaluate?
We did look at Dynatrace.
What other advice do I have?
The service was great during the initial testing phase. However, once we bought the product, the quality of service dropped significantly. However, in the past year or so, it has improved and is now approaching the level we'd expect based on the cost.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Google
Disclosure: My company does not have a business relationship with this vendor other than being a customer.