What is our primary use case?
We primarily use Datadog for the monitoring of EC2 and ECS containers running mostly Rails applications that host a SaaS product. We also monitor ElasticSearch and RDS, and we are working on adding their Application Performance Monitoring solution to monitor our applications directly.
We use DataDog to create dashboards, graphs, and alerts based on interesting metrics. DataDog is our first place to look to find the performance of our system.
We also use their logging platform and it works well. Especially useful is that the logs and metrics are tightly integrated so you can jump between them easily.
How has it helped my organization?
Developers are able to see how code is running in production, where this was mostly opaque previous to us implementing DataDog. We are able to emit custom metrics that are specific to our business, and the built-in metrics have also proven useful. Having a wealth of information has helped us investigate outages, and having historical data helps us tune our system.
DevOps engineers are able to put sensors around our system to proactively detect problems, whereas before, our engineers heard about problems from customers. Logs are easier to find for developers.
What is most valuable?
Metric graphing and Dashboards are the most valuable features because they give us good observability into our system and work well to alert us when interesting things happen. We use this functionality daily.
We value the monitoring capability since it allows us to be pushed alerts, rather than have to observe graphs continually. The integrations with Slack and PagerDuty enable us to be interrupted appropriately and keep a running tab on the system without bothering us unnecessarily.
The online process monitoring has been extremely helpful, as it gives engineers the ability to see the live status of all the processes running our systems without them having to log in.
What needs improvement?
Their logging solution is expensive for our use case. They do have the capability to rehydrate old or incomplete logs, and it works, but I would rather not have to think about that operation.
Datadog has a lot of documentation, but a lot of that documentation assumes you know how the service works, which can lead to confusion. Positive note is that they do have lots of documentation, it just needs better curation.
Their APM solution still needs some work, but they are actively developing it. I would also like to see more database-specific application monitoring.
For how long have I used the solution?
I have been using Datadog for five years across two companies.
What do I think about the stability of the solution?
Any issues are addressed and communicated very quickly. I have not had any issues with uptime.
What do I think about the scalability of the solution?
If you do not need 100% of data such as logs, APM traces, etc., this scales well. It does not scale as well if you want 100% of your logs indexed. You should understand any other usage-based bills before using any part of their service as it is very easy to run up a large bill.
The performance of the system scales very well, and host monitoring and APM are relatively cheap.
How are customer service and technical support?
Account support is excellent.
Customer support is good if you get them to go beyond pointing out the right documentation.
Which solution did I use previously and why did I switch?
Previously, I used homebuilt solutions with Nagios and Cacti but found that there was far too much work to understand them and keep them up and fed compared to the value that I got. They also did not integrate well with existing data sources without a lot of effort.
I also previously used StackDriver and found it too opinionated. I like that DataDog gives you tools to work with certain types of data and make your own graphs, monitors, etc., whereas, with StackDriver, I felt like there were a limited number of ways you could accomplish goals.
How was the initial setup?
The basic setup is easy. A more advanced setup can be tricky because the documentation assumes you know how the system works already. Support is somewhat helpful, but mostly points out the documentation you should already have found.
What about the implementation team?
What's my experience with pricing, setup cost, and licensing?
My advice is to understand what number of hosts and data you want to commit to. Beware that usage-based billing is both a blessing and a curse. It is easy to run up a large bill, so become familiar with the cost of each piece of your bill and use the metrics they supply to estimate and monitor your bill.
I have had good luck with their support team helping us to figure out the correct commit levels. Their account support is excellent in this regard. I have heard their sales team can be aggressive, but I have not experienced it personally.
Which other solutions did I evaluate?
I originally chose Datadog because of my previous experience. We recently considered moving over to New Relic because we liked their APM solution better. However, the pricing of New Relic and our familiarity with Datadog won over. New Relic is a good product but it didn't fit our overall needs as well as Datadog.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Disclosure: I am a real user, and this review is based on my own experience and opinions.