Name: Splunk Observability Cloud
Brand: Splunk
Rating: 4.1 (86 reviews)

Ernesto Gutierrez

Solutions Architect at Ikusi

Sep 13, 2025

Download

Deployment optimized and demos delivered faster for the retail sector thanks to customizable dashboards

Pros and Cons

"The feature of Splunk Observability Cloud that I prefer most is the easy deployment on the cloud."

"In terms of additional features I would want to see in future releases, since Cisco acquired Splunk, more Cisco integration could be beneficial."

What is our primary use case?

For the retail sector, we are building a solution for customer stores in order to know how the products are sold.

What is most valuable?

The feature of Splunk Observability Cloud that I prefer most is the easy deployment on the cloud. The benefit of that feature for my organization is to optimize the deploys and implementation and the response to our customers, to quickly make a demo. Splunk Observability Cloud has helped improve our operational performance, especially for our customers.

My experience with the out-of-the-box customizable dashboards provided by Splunk Observability Cloud is that they are effective in showcasing IT performance to business leaders. For the initial point of contact, it helps and works nicely as a star point. Then, you have the basics and use that as a framework to deploy others, so they are very helpful.

What needs improvement?

Splunk Observability Cloud can be improved. In terms of additional features I would want to see in future releases, since Cisco acquired Splunk, more Cisco integration could be beneficial.

For how long have I used the solution?

I have been using Splunk Observability Cloud for the last two years.

Buyer's Guide

Splunk Observability Cloud

April 2026

Free Report: Splunk Observability Cloud Reviews and More

Learn what your peers think about Splunk Observability Cloud. Get advice and tips from experienced pros sharing their opinions. Updated: April 2026.

DOWNLOAD NOW

893,244 professionals have used our research since 2012.

What do I think about the stability of the solution?

I have not experienced any downtime, crashes, or performance issues.

What do I think about the scalability of the solution?

Splunk Observability Cloud scales very well with the growing needs of my organization, as we just need to add a license or data ingestion.

How are customer service and support?

I would evaluate customer service and technical support for Splunk Observability Cloud as good. They respond effectively and in time.

Which solution did I use previously and why did I switch?

Prior to adopting Splunk Observability Cloud, we used other solutions to address similar needs, such as Dynatrace and ElasticSearch.

How was the initial setup?

It is easy to deploy on the cloud.

What was our ROI?

I have not seen a return on investment with Splunk Observability Cloud yet, as we are relatively new to it.

What's my experience with pricing, setup cost, and licensing?

My experience with pricing, setup cost, and licensing of Splunk Observability Cloud is that it is somewhat expensive, considering I am from Mexico and the market in Mexico is very different from the market in the USA. It is expensive, especially when there are other vendors that offer something similar for much cheaper.

Which other solutions did I evaluate?

The factors that led me to consider the change to Splunk Observability Cloud include performance and cost, and it depends on the customer. If the customer is a network user or partner with all Cisco solutions, Splunk Observability Cloud fits perfectly.

However, if we have a new customer that doesn't have any Cisco products, it might be better for them to use another solution that is easier to deploy and not as complete as Splunk Observability Cloud, especially if they only need one or two features.

What other advice do I have?

My advice to other organizations considering using Splunk Observability Cloud is that if you want a comprehensive, consistent tool or solution, it is one of the leaders in the market because it integrates with the network side of their organization, including Cisco solutions. Regarding customers who don't come from the Cisco world, it is a good choice, depending on their use. However, for small customers or those that are not large companies, Splunk Observability Cloud may not be the best fit, as it is a comprehensive tool. In Mexico, we observe that customers claim they only need APM or infrastructure monitoring, a very basic requirement, and don't require the entire Splunk portfolio.

On a scale of one to ten, I rate Splunk Observability Cloud a nine.

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Disclosure: My company has a business relationship with this vendor other than being a customer. Partner

Last updated: Sep 13, 2025

Juan Baez

Software Engineer at UKG

Sep 11, 2025

Download

Dashboards have provided a central place to visualize and manage large volumes of log data

Pros and Cons

"The dashboards are the features of Splunk Observability Cloud that I appreciate the most, providing visual representation of all data and text, which has benefited my organization by speeding up people's jobs and allowing a place to monitor all logs, as there are usually thousands of entries coming in which can become very disorderly."

"The main improvement I would suggest for Splunk Observability Cloud would be offering the ability to implement custom apps, specifically allowing Python scripts that Splunk Cloud could host."

What is our primary use case?

My main use cases for Splunk Observability Cloud are indexing, dashboards, alerts, and reports.

What is most valuable?

The dashboards are the features of Splunk Observability Cloud that I appreciate the most, providing visual representation of all data and text. These features have benefited my organization by speeding up people's jobs, allowing a place to monitor all logs, as there are usually thousands of entries coming in which can become very disorderly. Users can monitor everything and write queries to organize the data and build dashboards to visualize it. This creates one-stop shops to get answers on how products and applications are performing, as opposed to having to jump onto servers and look through numerous logs.

What needs improvement?

The main improvement I would suggest for Splunk Observability Cloud would be offering the ability to implement custom apps, specifically allowing Python scripts that Splunk Cloud could host. Currently, we cannot create custom apps through Splunk Cloud. Additionally, continuous performance improvements for faster searching and indexing would be beneficial.

For how long have I used the solution?

I have been using Splunk Observability Cloud for over the last year.

What do I think about the stability of the solution?

I would assess the stability and reliability of Splunk Observability Cloud as good. There have been some performance issues, though not necessarily crashes, occurring approximately 20% of the time or less.

What do I think about the scalability of the solution?

Splunk Observability Cloud scales smoothly with the growing needs of my organization. There have been some cases of performance loss due to rapid onboarding. We are handling multiple terabytes of data daily, so we expect some hiccups, but otherwise, it has scaled effectively for our fast-paced migration.

How are customer service and support?

My experience with customer service and technical support has been very present and super responsive. When we submit a case on Splunk support, they usually reach out within the same day or next day. They have consistently helped us resolve any issues we've encountered.

How would you rate customer service and support?

Positive

Which solution did I use previously and why did I switch?

I used Splunk Enterprise before adopting Splunk Observability Cloud. While other parts of the company were leveraging different logging tools, we primarily revolved around Splunk. When Splunk Cloud became available as the next option, we were ready to migrate.

How was the initial setup?

I haven't had personal experience with pricing, setup cost, and licensing as it's managed by our managerial side.

What was our ROI?

I have seen a return on investment with Splunk Observability Cloud through faster debugging and troubleshooting capabilities with enhanced observability. A significant return on investment comes from not having to host Splunk Enterprise ourselves. Having servers on Splunk's end allows us to focus more on development, monitoring, and our products, rather than maintaining our own local version of Splunk.

What other advice do I have?

I would rate Splunk Observability Cloud overall as a solution 9 out of 10.

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Google

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Last updated: Sep 11, 2025

Buyer's Guide

Splunk Observability Cloud

April 2026

Free Report: Splunk Observability Cloud Reviews and More

Learn what your peers think about Splunk Observability Cloud. Get advice and tips from experienced pros sharing their opinions. Updated: April 2026.

DOWNLOAD NOW

893,244 professionals have used our research since 2012.

Abdelmonam LABBOUZ

Splunk Observability Expert

Apr 26, 2025

Download

Adopted global standards enhances data collection and simplifies monitoring

Pros and Cons

"It's beneficial for monitoring performance and infrastructure, especially when deploying applications with multiple versions with Git."
"The solution overall is very valuable for me."

"Regarding dashboard customization, while Splunk has many dashboard building options, customers sometimes need to create specific dashboards, particularly for applicative metrics such as Java and process terms. These categories of dashboards would be very helpful for customers."
"I would rate Splunk technical support at six out of ten. When we have a problem and need to create a case, the response isn't quick."

What is our primary use case?

The solution involves observability in general, such as Application Performance Monitoring, and generally addresses digital applications, web applications, sites, and mobile applications. I worked with it in two companies: one in the energy sector and one in the hotel sector.

The Splunk teams helped us with data collection, instrumentation, and many other options.

How has it helped my organization?

The testing and monitoring of infrastructure is useful. We also use it for many metrics and can use it effectively for troubleshooting and for detection. It's very helpful.

What is most valuable?

With Splunk Observability Cloud, I appreciate working with open telemetry. The standards of open telemetry are especially useful for collecting data such as traces, matrices, and logs. Splunk respects the standards of open telemetry. This is beneficial. Many clients work with AWS and the cloud in general with multiple solutions such as Datadog, Dynatrace, and Splunk. Working with the standard open telemetry is very advantageous. Splunk Observability Cloud is very simple for users in general, including developers, DevOps, and data teams. It's more straightforward compared to Dynatrace.

There are many out-of-the-box solutions proposed by Splunk, such as dashboards for AWS instances, EC2, Fargate, and Lambda. It's very helpful for beginning, especially for monitoring, and the detectors for alerting help understand how the platforms work.

The no-sample feature is great. It eliminates blind spots.

After completing the instrumentations, we have many dashboards and tests for monitoring infrastructure, particularly CPU and memory. We also use applicative metrics such as JVM, Java Runtime, and many other applicative metrics and testing. For troubleshooting, we can detect problems in seconds, which is particularly helpful for digital teams.

AI analytics have the potential for a lot of functionality. The detectors for alerting may prove useful.

When we deploy the instrumentation in the application, we can start using the dashboards immediately. The dashboard building is very helpful for starting work.

It's beneficial for monitoring performance and infrastructure, especially when deploying applications with multiple versions with Git. It's important to detect performance issues, such as CPU consumption or memory consumption, particularly over time in Java and Python.

For other teams, they need help and guidance to use custom metrics. For observability engineers and specialists, it's straightforward, but for others, it can be challenging.

The solution overall is very valuable for me.

The time to value was immediate. Once we deployed, we started to use the dashboard directly and began detecting issues.

Saving time with automation can save us weeks. It's improving our resilience. It helps us detect issues and increase performance.

The solution has been very useful for helping us focus on business-critical initiatives.

What needs improvement?

Regarding dashboard customization, while Splunk has many dashboard building options, customers sometimes need to create specific dashboards, particularly for applicative metrics such as Java and process terms. These categories of dashboards would be very helpful for customers.

For how long have I used the solution?

I started working with Splunk Observability Cloud in 2023.

What do I think about the stability of the solution?

The system is relatively stable. We rarely have problems accessing the dashboard or the page. We encounter problems in the Splunk platform very rarely.

What do I think about the scalability of the solution?

It's very scalable. We haven't experienced any problems with the instrumentation or scalability. On a scale of one to ten, I'd rate it a ten.

We've used the solution across more than 250 people, including engineers.

How are customer service and support?

I would rate Splunk technical support at six out of ten.

When we have a problem and need to create a case, the response isn't quick. They often require multiple questions, with five or six emails to get a response. Problem resolution typically takes between two and five days, which isn't very helpful. However, sometimes we do receive quicker solutions.

How would you rate customer service and support?

Neutral

Which solution did I use previously and why did I switch?

We used legacy solutions such as Grafana and Prometheus. There are several differences between Splunk Observability Cloud and these solutions. We used Grafana as a monitoring solution, however, it's not truly observability. We used OpenSearch for logs, Prometheus for metrics, and Grafana to work with Prometheus. That said, it's not equivalent. Observability is different.

We're also familiar with Datadog and Dynatrace.

How was the initial setup?

The implementation took between two and three weeks.

For cloud deployment, it's straightforward. We can use GitLab and DevOps CI/CD. For on-premise deployment, such as Linux and deployment with satellite, it's easy yet requires some work to configure the configuration files.

Updates are generally needed, especially for the open telemetry version or SDK. However, regarding the platform itself, we don't need to do anything.

What was our ROI?

I worked with my company when they used the solution, so I'm not certain about the history of how long it took to detect problems. However, for mean time to detect, and mean time to respond, I'm sure it's very helpful, and we can estimate a minimum improvement of 20%.

What other advice do I have?

We're a customer and end-user.

Currently, in France, we cannot use the artificial intelligence option. While this option is enabled for the United States and many countries, it's not yet available in France. However, the solution with detectors, especially for alerting, is important for us.

I recommend it, especially for teams using legacy monitoring.

I would rate Splunk Observability Cloud nine to ten out of ten.

Which deployment model are you using for this solution?

On-premises

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Ashutosh Parmar

Dev Ops Engineer at Veefin

Apr 19, 2026

Download

AI-driven observability has reduced resolution times and improves real-time monitoring

Pros and Cons

"Splunk Observability Cloud is highly effective in improving digital resilience, as real-time visibility, proactive alerting, fast root cause analysis, distributed tracing, and AI-driven insights enable anomaly detection, which allows us to quickly understand failures, recover faster, maintain system availability, and handle failures in complex distributed environments by seeing how services interact and where breakdowns occur."

"I would say that it is quite helpful, but for different kinds of applications, it could be improved because sometimes it might provide a cloud judgment of the root cause analysis."

What is our primary use case?

I mostly work with the performance metrics of the CPU, or host metrics, as well as application metrics and traces. Overall, I use these mostly for real-time monitoring based on the application to track application performance.

For the monitoring of infrastructure, it is quite insightful because in-depth, I can see what is going on in the infrastructure. If something goes down or some crons fail inside the infrastructure, the alerts are quite helpful for more visibility on the cloud-native side.

This is quite helpful for improving the application observability and the infrastructure side as well. I would rate observability above an eight.

I am not that much involved in the business side because I work as a DevOps engineer, so I do not know how much it helps on that front. However, it helps in tracking traces and metrics quite generously well and helps us improve the application side for more reliability on the business side.

What is most valuable?

It is very helpful and really enhances the AI-powered analytics, which helps us for troubleshooting the application and to get more insightful information while troubleshooting application error rates.

AI-powered guidance is really helpful because it provides more actionable insights and highlights anomalies automatically. I do not need to go through it manually, and it also helps us with smart alerting and recommendations.

It helped operationally because due to the insights of the applications, I get more insight for our application to enhance it further. It detects anomalies and correlates data while guiding us to the root causes, so we can enhance our application accordingly.

I have seen that mean time to resolution was reduced around 30 to 50 percent. The main reason for this combination is because of real-time monitoring and AI-powered anomaly detection and distributed tracing. Instead of manually checking the logs and metrics across multiple tools, the platform quickly highlights the issues, correlates data, and points us towards the root cause.

After implementing Splunk Observability Cloud, there was a deep learning curve for the new tool. It took one or two months to get proper insights from it. After configuring, I have seen that it is very useful for tracking traces and metrics of our application, servers, and clusters. Adoption time is usually after two months, or after a few weeks of getting Splunk Observability Cloud.

Splunk Observability Cloud is highly effective in improving digital resilience. Real-time visibility and proactive alerting and fast root cause analysis, distributed tracing, and AI-driven insights enable anomaly detection, which allows us to quickly understand failures and recover faster. This is critical for maintaining system availability and helps us handle failures in complex distributed environments since we can see how services interact and where breakdowns occur.

What needs improvement?

Regarding features, it helps us for better understanding of how the application works and in-depth tracking of application monitoring.

It can be more enhanced using additional AI power. I can get more reliability using AI because AI-driven guidance is more useful nowadays. It can really improve more on the AI side because it will help us to reduce manual intervention with the system and root cause analysis will be much better with AI over human analysis.

I would say that it is quite helpful, but for different kinds of applications, it could be improved because sometimes it might provide a cloud judgment of the root cause analysis. I need to do manual intervention using a dedicated human for root cause analysis for better understanding of the root cause. This is how the agentic side can be improved.

For how long have I used the solution?

I have been working with Splunk Observability Cloud for around a year.

What do I think about the scalability of the solution?

It is quite scalable. Right now, it is providing much better insights and can be more enhanced over several aspects. I would rate scalability an eight to eight point five.

Which solution did I use previously and why did I switch?

I have tried other solutions, but they were not that great in terms of functionalities and overall performance. Splunk Observability Cloud is much better than the others because it provides AI alongside the solution. This is very helpful due to the AI-driven solutions and guidance for root cause analysis. Splunk Observability Cloud goes through the details of application traces and metrics in depth, so I get better observability over the application. This is why I have preferred Splunk Observability Cloud over other monitoring tools.

I have tried SignalFx, but it was not quite insightful. I have tried Splunk Observability Cloud over SignalFx.

What other advice do I have?

Splunk Observability Cloud is quite insightful and helpful for improving the observability side. I provide this solution an overall rating of eight.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Last updated: Apr 19, 2026

HrishikeshNavkar

Senior Software Engineer at WorldPay US

Feb 4, 2026

Download

Metric-based monitoring has simplified alerting and currently supports our cloud migration

Pros and Cons

"Comparing to Cloud, Splunk Cloud, or any other solution, the most valuable feature of Splunk Observability Cloud is that it is entirely based on metrics."

"We were facing some challenges with the stability of Splunk Observability Cloud regarding the login page. It was not working several times and was not accepting SSO authentication."

What is our primary use case?

Currently, we are in the process of migrating from on-premises to Splunk Cloud as well as Observability. For metric-based monitoring, we can monitor via Observability and are migrating it there. We are setting up private locations to monitor synthetic tests, such as ping checks, port checks, and URL monitoring. The rest is metric-based monitoring, which is being done by Splunk using Splunk OTeL, which is an OpenTelemetry agent for Observability. This agent brings metrics from end devices to Observability. Based on these metrics, we set detectors and rules to trigger alerts.

Our observability is not yet live in production with Splunk Observability Cloud. It is currently being built, and we are adding new components, but it is not yet fully ready.

What is most valuable?

Comparing to Cloud, Splunk Cloud, or any other solution, the most valuable feature of Splunk Observability Cloud is that it is entirely based on metrics. The agent is also very lightweight compared to Splunk UF and does not consume much compute resources on the end server or host from which we are pulling data. However, it can only monitor metrics and cannot monitor logs.

Regarding how Splunk Observability Cloud has benefited our organization, we are yet to go live, but most of the configuration that requires conditions and triggers on Splunk Cloud involves writing queries. With Splunk Observability Cloud, the process is quite simple. We can directly get metrics flowing, set thresholds, and everything is UI-based. This requires less time to set up and use. I do not have that much visibility with Splunk Observability Cloud at this time as I am working as an administrator. It has helped us create dashboards for visualization purposes.

What needs improvement?

There is one thing that could be improved in Splunk Observability Cloud. We have the capability in Splunk to connect to Splunk agents such as Splunk forwarders from a deployment server and update the end agents and forwarders using server classes. We can push and update configurations from our own hosted servers without needing to access the end device. In Splunk Observability, the OTeL agent cannot be updated from our end. Every time we need to update, we have to reach out to users or gain access to the host to update the configurations. There should be a solution to update OTeL agents from Splunk Observability Cloud itself.

For how long have I used the solution?

I have been working with Splunk Observability Cloud for approximately five to six months.

What do I think about the stability of the solution?

Splunk Observability Cloud is reliable based on my experience with stability and reliability so far.

We were facing some challenges with the stability of Splunk Observability Cloud regarding the login page. It was not working several times and was not accepting SSO authentication. The observability team found a solution for this issue, though I am not fully aware of the details. There were several times when opening the page did not directly log in and showed some errors.

What do I think about the scalability of the solution?

I have not encountered any scenarios regarding the scalability of Splunk Observability Cloud. It should be good because it is cloud-based. I am not aware of the licensing model and how it scales or what the rules are for scaling.

How are customer service and support?

I was not directly involved with technical support for Splunk Observability Cloud, but I am aware that my teammates reached out to support. They were finding issues regarding configuration, installation, and deployment of Observability for specific components. Since Observability is cloud-based and hosted by Splunk, the components we own on-premises are the OTeL gateways, agents, and private locations. They reached out to the vendor regarding these components, and the support was quite smooth. They have raised some bugs as well for the vendor to fix. I would rate the technical support from Splunk an eight out of ten.

How would you rate customer service and support?

Positive

How was the initial setup?

Since it is cloud-based, Splunk Observability Cloud was ready to use upon deployment. The OTeL gateways were built by our team and required configuration. I was not part of that process but am aware that we needed to configure the OTeL gateways to route data to them as an endpoint and from there it would be ingested to Observability or forwarded to Observability. There were no significant issues with this process and it was quite smooth. However, configuring private locations on a few gateways was quite difficult to set up and maintain because Docker was going down at times. There were some issues that were discussed with Splunk vendor, and they provided guidance on how to fix them.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Last updated: Feb 4, 2026

Ken Fillinger

Assistant Vice President & Software Engineer & Infraoperations- & Operations Developer at a financial services firm with 201-500 employees

Sep 11, 2025

Download

RUM data has improved visibility into user paths and strengthened operational performance

Pros and Cons

"Splunk Observability Cloud has helped improve our operational performance and company's resilience, which was their initial offering."

"I've only experienced downtime, crashes, or performance issues when it isn't configured correctly."

What is our primary use case?

Our main use cases include synthetic monitoring, APM, RUM, alerting, detectors, dashboards, and all related functionality.

What is most valuable?

My favorite feature of Splunk Observability Cloud is the RUM because I like the RUM data. Splunk Observability Cloud has helped improve our operational performance and company's resilience, which was their initial offering. Without it, we wouldn't have any exposure and would be looking at raw data.

What needs improvement?

It can be improved through the integration of AI, which is either coming or already available.

For how long have I used the solution?

I've been using Splunk Observability Cloud for two and a half years.

What do I think about the stability of the solution?

I've only experienced downtime, crashes, or performance issues when it isn't configured correctly.

What do I think about the scalability of the solution?

Splunk Observability Cloud scales effectively with the growing needs of our organization; we simply need to pay more and ingest more data.

Which solution did I use previously and why did I switch?

Prior to this, I wasn't using another solution to address similar needs.

What was our ROI?

I've seen ROI with Splunk Observability Cloud, though I cannot specify the exact amount. We would be unable to track user access issues without it, which would result in significant losses.

What other advice do I have?

I use the cloud almost exclusively and am still learning some features. I handle synthetic monitoring, but don't manage all integrations or usage aspects. I need to explore the AI-powered analytics and guidance, as we haven't implemented it yet. The out-of-the-box customizable dashboards are effective because they contain all the necessary base components.

On a scale of one to 10, I would rate Splunk Observability Cloud as an eight. I appreciate the cloud because it provides more visibility into the user's path. It's quite good, though the observability aspect is somewhat complicated, primarily due to my limited experience with it.

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Last updated: Sep 11, 2025

Lakshmi Padaga

Java Developer at U.S. Bank

Sep 11, 2024

Download

Collaborates performance metrics with log data to pinpoint the exact cause of issues and offers error detection

Pros and Cons

"It supports proactive management, enhances security, and improves operational efficiency."

"There are always areas for potential improvement to enhance its functionality and user experience."

What is our primary use case?

We use Splunk in APM to monitor our applications. So, we integrated it into our systems to enhance our monitoring and observing capabilities, especially for our microservices.

So, we have used Splunk APM for this.

How has it helped my organization?

APM integrates well with Splunk’s other observability solutions. These logs with application performance monitoring can significantly impact our business in several positive ways, like troubleshooting and root cause analysis.

Using these logs with APM, we can collaborate performance metrics with log data. It allows us to pinpoint the exact cause of issues, such as identifying specific errors in the logs. Because of this, we have access to faster resolution and detailed logs alongside performance metrics, enabling quicker diagnosis and resolution of problems. It also helps us minimize downtime and improve system reliability.

Additionally, it improves our performance optimization with detailed insights and analyzing historical log data along with APM metrics. This allows us to understand long-term trends and make informed decisions about performance improvements and better user experience, like error reduction and proactive monitoring.

Splunk has reduced our mean time to resolution by 30%.

If there is any issue in Splunk; we’ll identify the issue first and look for the error messages, like alerting with the Splunk user interface or in the logs that might indicate what the issue is and then determine which part of Splunk is affected.

Then, we’ll refer to the Splunk official documentation and check the system's health. We’ll review the logs. By following these steps, I can resolve the issues with Splunk, ensuring that our monitoring and analytics capabilities remain effective.

What is most valuable?

Mainly, I like Splunk APM because it will show the errors compared to other tools. We use the dashboards to monitor our applications. It will tell us the errors, and we can solve them quickly.

I have used APM but haven’t used Trace Analyzer, though I have some knowledge of it. We are able to implement it. We have some Trace Log Points in Splunk APM to catch the errors. We have a special graph for it where we can see the red points.

We use OpenTelemetry. OpenTelemetry and Splunk APM are similar in terms of observability and monitoring. We use it for observability standardization, which allows us to collect traces and metrics, making it easier to work with different monitoring tools, including Splunk APM. It is more flexible because it allows us to instrument our applications without being locked into a specific monitoring vendor.

It supports collecting traces, metrics, and logs from our applications, providing a comprehensive view of our performance and health endpoints. This data can be fed into Splunk APM, giving us in-depth analysis and insights about our application.

What needs improvement?

Splunk APM is a robust tool with many capabilities. There are always areas for potential improvement to enhance its functionality and user experience.

For Splunk APM, there could be simplified navigation, like streamlining the user interface to make navigation more intuitive for our users, especially those new to APM, which can enhance usability. We can provide more customization options for dashboards and visualizations to help users tailor the platform to their specific needs.

There could be more integration capabilities with a wider range of third-party tools and platforms would also be beneficial. By focusing on these areas, Splunk APM can enhance its value proposition, improve user satisfaction, and better meet the evolving needs of organizations monitoring their application performance.

For how long have I used the solution?

I have been using it for a year.

What do I think about the stability of the solution?

I never had an issue with the stability. It worked fine.

Which solution did I use previously and why did I switch?

My team has used alternatives to Splunk APM, like Datadog and New Relic.

How was the initial setup?

The initial setup was easy. To fully deploy it, we had to add some signal effects into our applications and just deploy it. It took like 20 minutes. That’s it.

What about the implementation team?

We took some help from our teams and my senior manager and also from other teams across our company. We connected and did all this together.

For deployment, one person can do it, actually, but as we are junior developers, we took help from our senior manager, like three to four people.

Splunk is good like this now. I don’t think any updates would be required, but there are some regular updates and upgrades of Splunk APM, like software updates, version upgrades, and all.

These provide more powerful monitoring capabilities and help ensure the system remains reliable, secure, and aligned with organizational needs. Regular updates, performance tuning, and proactive management help in maximizing these benefits of the Splunk solution.

What was our ROI?

We’ll see the results after the deployment. It’s not that late, and that’s the reason we are using Splunk APM.

Splunk made our job easier in a way. It will give the points when we use any dashboards, and there are no delays in everything, like performance. It will give the error issues very clearly, and it will monitor 24/7. It will show the issues, and it is very effective. It will pinpoint the exact cause of the issues, and it will help us troubleshoot the issues very fast.

It benefits the IT staff in other teams, like operations, improves efficiency, and manages the IT environments more effectively. When it centralizes the logs and search analytics, the powerful capabilities allow IT teams to perform in-depth troubleshooting, identify root causes, and analyze complex issues with ease.

Splunk also provides real-time visibility into IT infrastructure, and we have connected with cross-functional teams around our team to work with Splunk APM. It supports proactive management, enhances security, and improves operational efficiency. It facilitates better collaboration across the team.

What's my experience with pricing, setup cost, and licensing?

The pricing is based on several factors, including the scale of deployment. The pricing model typically includes considerations like the number of hosts, features, and capabilities.

What other advice do I have?

Overall, I would rate the solution a nine out of ten.

Which deployment model are you using for this solution?

Public Cloud

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Evan Torrie

Principal software architect at Verizon

Jul 9, 2024

Download

Useful to find statistical similarities between different traces

Pros and Cons

"The product's deployment phase is good and very easy because it is done with OpenTelemetry for most of the parts."

"Once you see the issues related to the scalability part, you need to understand that it is a warning triangle. After seeing the warning triangle, you need to realize that you cannot trust any of the numbers you see in the chart because it is not a complete, full data set."

What is our primary use case?

I use the solution in my company primarily for distributed tracing and metrics troubleshooting. I use the tool to troubleshoot incidents and find the root cause of errors when something goes wrong. I also personally use it to have a developer's understanding of what is going on in my application. Sometimes, there is a case where you might put your application in a library or a new library, and that library also makes calls somewhere. Splunk APM's monitoring can show you that there is a call you are making now that you never used to make in the prior version of the library. In these cases, which you may not know just by looking at the external view of the application code, the tracing part traces everything, including the lowest types of supports.

How has it helped my organization?

The main benefit of the tool I have noticed in the solution is reduced time for the resolution of incidents. The meantime to resolve can help pinpoint the root causes of the issues because you see the connections on the graph in Tag Spotlight. It is easier to pinpoint who is responsible for the incident, especially when you have a larger organization. You have teams that ride services where they need to talk to different services from different teams rather than having to hand off instant resolutions from one team to another. You can often find it much more quickly from the first instance of the problem occurring with the product in place. The tool specifically helps your sites move up more frequently, and then when it does go down, the solution finds the root cause and gets it back up as fast as possible.

What is most valuable?

The most valuable feature of the solution, and my favorite, is always Tag Spotlight, especially considering the way they slice and dice all of Splunk APM's traces by span attributes.

I like the tool because it looks at a whole set of traces in aggregate, which means that it can find statistical similarities between different traces. Often, the cases are such that you will find some traces that show an error and have some other common attribute, which is much more apparent when you look at the feature known as Tag Spotlight rather than just looking at an overall metric. I like Tag Spotlight as it is one of the most simple to use features.

The meantime to resolve, or MTTR, can help pinpoint the root causes of the issues because you see the connections on the graph in Tag Spotlight. I don't personally have metrics associated with MTTR. I am more of the implementer of making certain that all the data is going in and looking at the debugging part. I am not a part of the set of people who keep track of the tool's MTTR.

In our company's case, we have reasonably good metrics related to the meantime to detect. I can't get a rough number when it comes to the meantime to detect, so I don't know for sure. My guess is that we often detect problems reasonably well. Our company figures out that there is some problem, but we just don't know where it is, so I feel that if there is an improvement, then it is mostly in the area of meantime to resolve. When it comes to the meantime to detect, I think our existing metrics are probably sufficient, and adding Splunk APM makes it much easier to detect the resolution time.

The tool has improved our organization's business resilience. In terms of resilience, in the tool, it is possible not to have downtime and make certain things up and running. The faster you get to web pages working again, the more people can actually do things that they want to do, such as trade players on their NFL Fantasy teams. In general, it gives out a better business result.

What needs improvement?

In our company's case, we have some very high throughput services, so they might be getting 10,000 requests per second. Currently, Splunk APM and Splunk Observability want to do things in a way that wants you to send every single span for every single request that is a part of the 10,000 requests per second. The process may give you all the data in the back end, but a lot of data, including CPU memory and network costs, is involved in sending data to Splunk. My feeling is that it would be nice if there were an easier way to send only a sample of my traces, which means that I send 10 percent or 5 percent, and then Splunk would extrapolate on the back end. It is obvious that with 10 percent of traces, the real metrics are something like ten times with a plus or minus margin of error. I am okay with the plus or minus margin of error because I think when you have a high enough request rate, you will see such problems appear even in a lower sample population. The process is political polling. You don't call all 150,000,000 people in the US and ask them who they are going to vote for, and I feel it is better if you choose to take a sample of maybe 10,000 and then extrapolate your findings to the rest. I feel the same should be applicable to trace something in Splunk APM.

For how long have I used the solution?

I have been using Splunk APM for two years.

What do I think about the stability of the solution?

I really haven't noticed anything going wrong with the tool's stability, and I haven't seen any downtime. I don't know if my company is necessarily measuring the stability part by ourselves, but at least for me, it is a pretty growing and solid solution.

What do I think about the scalability of the solution?

There is one issue with the tool's scalability. In our company, we are fairly big in terms of the number of containers we have, especially since we can run very large clusters. When you look at some of the charts, it will say 30,000 time series, reached the limit, and cannot show anymore, or it states that a particular data may not be complete. For me, it is a problem that I would like to see fixed. I have spoken to Spunk's team about it, and they have told me that they do recognize the issue and that other people have also mentioned the same problem. Once you see the issues related to the scalability part, you need to understand that it is a warning triangle. After seeing the warning triangle, you need to realize that you cannot trust any of the numbers you see in the chart because it is not a complete, full data set. I want the tool to either tell me that it can't show me the numbers or that I need to find some way to show all the numbers in a more summarized view. The tool asks you to filter things down more, but it would be nice to offer specific suggestions as to what you could filter down to get it into a more specific or reasonable number. In some cases, my company just has to have a number, considering that we have 1,00,000 containers. If I want to know how many containers are running, currently, the way the backend works in a way where it requires to know how many different time series there are, and then it just says that the 30,000 limit has been reached, but when it happens, I don't know whether it is for 1,00,000 containers, 1,20,000 containers or 80,000 containers.

How are customer service and support?

The technical support team for the solution is good for our company. My company has a weekly meeting with Splunk's sales support team, and if there are any issues, we bring them up for discussion. I have seen that the technical support team is super responsive.

Which solution did I use previously and why did I switch?

My company has its own internal solution, which was built ten to fifteen years ago, and it has progressed over time, but it is only ever used to support metrics and events, not for tracing. In short, it is not used for Splunk APM-related stuff, which is a big change that makes a difference for us.

How was the initial setup?

The product's deployment phase is good and very easy because it is done with OpenTelemetry for most of the parts. The product's deployment is not some custom thing where you have to deploy a particular agent that belongs to a particular company and put it on every single host. It is very easy to follow OpenTelemetry's models for the most part. Splunk is a very big contributor to OpenTelemetry, and I value it. It consists of the reasons I recommend using Splunk as a backend provider. In my company, we are more open to being more of an OpenTelemetry-compliant organization instead of going for other vendors.

What was our ROI?

I can't speak about the tool's ROI since I get paid, but I don't have to spend money on the product.

What's my experience with pricing, setup cost, and licensing?

I don't have much insight into the costs and licensing area attached to the tool. I am the engineer and developer, not the person who writes the checks in the company. I know that my company has a Splunk Enterprise Security license which is used for logging and even for Splunk Observability.

What other advice do I have?

I think the tool has the best trace aggregation features compared to what I have seen in different products, and I feel Tag Spotlight is a good example of it. A lot of the other products support tracing, but when you look at them, you see that they show one trace at a time. I can deep dive into one trace at a time, but what I want to find is commonality across the traces. I think it will give the tool a high grade for all its features. I rate the tool highly since it offers a very good Kubernetes integration. With a lot of data, you can see which part the Kubernetes host is running on, switch between them, and see the application metrics and the actual infrastructure metrics. Seeing it all together can be very useful.

I rate the tool a nine out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.