We use Splunk APM to understand and know the inner workings of our cloud-based and on-premises applications. We use the solution mainly for troubleshooting purposes and to understand where the bottlenecks and limits are. It's not used for monitoring purposes or sending an alert when the number of calls goes above or below some threshold.
The solution is used more for understanding and knowing where your bottlenecks are. So, it's used more for observability rather than for pure monitoring.
The solution's service map feature allows us to have a holistic overview and to see quickly where the issues are. It also allows us to look at every session without considering the sampling policy and see if a transaction contains any errors. It's also been used when we instrument real use amounts from the front end and then follow the sessions back into the back-end systems.
Splunk APM should include a better correlation between resources and infrastructure monitoring. The solution should define better service level indicators and service level objectives. The solution should also define workloads where you can say an environment is divided up by this area of back end and this area of integration. The solution should define workloads more to be able to see what is the service impact of a problem.
I've been using Splunk APM in my current organization for the last 2 years, and I've used it for 4-5 years in total.
Splunk APM is a remarkably stable solution. We have only once encountered an outage of the ingestion, which was very nicely explained and taken care of by the Splunk team.
I rate the solution a 9 out of 10 for stability.
Around 50 to 80 users use the solution in our organization. The solution's scalability fits what we are paying for. On the level of what we pay for, we have discovered both the soft limit and the hard limit of our environment. I would say we are abusing the system in terms of how scalable it is. Considering what we are paying for, we are able to use the landscape very well.
We have plans to increase the usage of Splunk APM.
Splunk support itself leaves room for improvement. We have excellent support from the sales team, the sales engineers, the sales contact person, and our customer success manager. They are our contact when we need to escalate any support tickets. Since Splunk support is bound not to touch the consumer's environment, they cannot fix issues for us. It's pretty straightforward to place a support ticket.
We have previously used AppDynamics, Dynatrace, and New Relic. We see more and more that Splunk APM is the platform for collaboration. New Relic is more isolated, and each account or team has its own part of New Relic. It's very easy to correlate and find the data within an account. Collaborating across teams, their data, and their different accounts is very troublesome.
With Splunk APM, there is no sensitivity in the data. We can share the data and find a way to agree on how to collaborate. If two environments are named differently, we can still work together without infecting each other's operations.
If you're using the more common languages, the initial deployment of Splunk APM is pretty straightforward.
The solution's deployment time depends on the environment. If the team uses the cloud-native techniques of TerraForm and Ansible, it's pretty straightforward. The normal engagement is within a couple of weeks. When you assess the tool they need and look at the architecture and so on, the deployment time is very, very minimal. Most of the time spent internally is caused by our own overhead.
We have a very good conversation with our vendor for Splunk APM. We have full transparency regarding the different license and cost models. We have found a way to handle both the normal average load and the high peak that some of our tests can cause. Splunk APM is a very cost-efficient solution. We have also changed the license model from a host-based license model to a more granular way to measure it, such as the number of metric time series or the traces analyzed per minute.
We have quite a firm statement that for every cost caused within Splunk, you need to be able to correlate it to an IT project or a team to see who the biggest cost driver is. As per our current model, we are buying a capacity, and we eventually want to have a pay-as-you-go model. We cannot use that currently because we have renewed our license for only one year.
We are using Splunk Observability Cloud as a SaaS solution, but we have implemented Splunk APM on-premises, hybrid, and in the cloud. We are using it for Azure, AWS, and Google. Initially, the solution's implementation took a couple of months. Now, we are engaging more and more internal consumers on a weekly basis.
We implement the code and services and send the data into the Splunk Observability Cloud. This helps us understand who is talking to whom, where you have any latencies, and where you have the most error types of transactions between the services.
Most of the time, we do verification tests in production to see if we can scale up the number of transactions to a system and handle the number of transactions a business wants us to handle at a certain service level. It's both for verification and to understand where the slowness occurs and how it is replicated throughout the different services.
We can have full fidelity and totality of the information in the tool, and we don't need to think about the big variations of values. We can assess and see all the data. Without the solution's trace search and analytics feature, you will be completely blind. It's critical as it is about visibility and understanding your service.
Splunk APM offers end-to-end visibility across our environment because we use it to coexist with both synthetic monitoring and real user monitoring. What we miss today is the correlation to logs. We can connect to Splunk Cloud, but we are missing the role-based access control to the logs so that each user can see their related logs.
Visualizing and troubleshooting our cloud-native environment with Splunk APM is easy. A lot of out-of-the-box knowledge is available that is preset for looking at certain standard data sets. That's not only for APM but also for the available pre-built dashboards.
We are able to use distributed tracing with Splunk APM, and it is for the totality of our landscape. A lot of different teams can coexist and work with the same type of data and easily correlate with other systems' data. So, it's a platform for us to collaborate and explore together.
We use Splunk APM Trace Analyzer to better understand where the errors originate and the root cause of the errors. We use it to understand whether we are looking at the symptom or the real root cause. We identify which services have the problem and understand what is caused by code errors.
The Splunk Observability Cloud as a platform has improved over time. It allows us to use profiling together with Splunk Distribution of OpenTelemetry Collector, which provides a lot of insights into our applications and metadata. The tool is now a part of our natural workbench of different tools, and it's being used within the organization as part of the process. It is the tool that we use to troubleshoot and understand.
Our organization's telemetry data is interesting, not only from an IT operational perspective but also to understand how the tools are being used and how they have been providing value for the business. It is a multifaceted view of the data we have, and it is being generated and collected by the solution.
Splunk APM has helped reduce our mean time to resolve. Something that used to take 2-3 weeks to troubleshoot is now done within hours. Splunk APM has freed up some resources if we are going to troubleshoot. If you spend a lot of time troubleshooting something and can't find a problem, we cannot close the ticket saying there's no resolution. With Splunk APM, we can now know for sure where we have the problem rather than just ignoring it.
Splunk APM has saved our organization around 25% to 30% time. It's a little bit about moving away from firefighting to be preventive and estimate more for the future. That's why we are using it for performance. The solution allows us to help and support the organization during peak hours and be preventative with the bottlenecks rather than identify them afterward.
Around 5-10 people were involved in the solution's initial deployment. Integrating the solution with our existing DevOps tools is not part of the developer's IDE environment, and it's not tightly connected. We have both subdomains and teams structured. Normally, they also compartmentalize the environment, and we use the solution in different environments.
Splunk APM requires some life cycle management, which is natural. In general, once you have set it up, you don't need to put much effort into it. I would recommend Splunk APM to other users. That is mainly due to how you collaborate with the data and do not isolate it. There is a huge advantage with Splunk. We are currently using Splunk, Sentry, and New Relic, and part of our tool strategy is to move to Splunk.
As a consumer, you need to consider whether you are going to rely on OpenTelemetry as part of your standard observability framework. If that is the case, you should go for Splunk because Splunk is built on OpenTelemetry principles.
Compared to other tools using proprietary agents and proprietary techniques, you may have more insights into some implementations. However, you will have a tighter vendor lock-in, and you won't have the portability of the back end. If you rely on OpenTelemetry, then Splunk is the tool for you.
Overall, I rate the solution a 9 out of 10.