What is our primary use case?
In my organization, we have 150 to 160 applications yearly with different frameworks including .NET, Java, and Python based applications. All of them are hosted on different types of servers such as Windows, Linux, ECS, and EKS. With respect to deployments, we integrated Splunk Observability Cloud. Previously, we used Prometheus and Grafana. My organization considered Splunk Observability Cloud to be a premium side of observability, so they switched from our previous solution.
We use the tracing feature in Splunk Observability Cloud.
What is most valuable?
I appreciate the service map and APM in Splunk Observability Cloud the most. This is the main feature I value. The interface is completely UI based, so I can see the complete service map, observe the latency present, and view complete metadata for a particular service or any database-related service. The service map enables a 3D view of the complete application architecture.
With respect to the effectiveness of Splunk Observability Cloud in improving digital resilience within the organization, it was quite similar to other third-party tools. The main distinction is that it has some improved security. We use SignalFlow queries, and with respect to those queries, we work with alerts and the dashboarding part. I can say it provides efficiency with improved security compared to other third-party tools, but in terms of usage, it is quite similar to Prometheus and Grafana.
What needs improvement?
I want to address a disadvantage regarding the service map showing misinformation with respect to latency, which relates to data reliability pulled from AWS cloud or on-premise servers. We saw issues with latency because Splunk APM app shows different data than Prometheus and Grafana. We tried to get premium support and on-call support with Splunk, and they were helpful in troubleshooting, but they ended up with no solution.
Performance with Splunk Observability Cloud is acceptable to me, but the modifications required by users are problematic. I had to build the complete alerting system and monitoring system, which had to be changed. The way they designed this is not optimal. If I compare with Prometheus, we can import and export dashboards, but here we face errors with dialogue boxes. We tried with technical support calls about this, but they were unable to solve it, so I do not understand why export and imports are not functioning.
The overall impression of the no-sample tracing feature in Splunk Observability Cloud, specifically in terms of eliminating blind spots in data collection, is that it needs improvement because the data is not adequate compared to other third parties. We get disturbance in the dashboards and charts while trying to correlate data. The mechanism functions differently manually than it does with a SignalFlow query, and both should be equal. We are unable to replicate from manual processes to the automation method, which is the issue.
The SignalFlow query feature in Splunk Observability Cloud needs improvement because it should function the same as manual processes. When we configure manual queries and then configure them via SignalFlow, they give different outputs. We tried with on-call support about this, but they were unable to address it, indicating there is a bug with the queries that needs improvement.
For enhancements, I would like to see improvements in the OTEL agents, OTEL collectors, and other features in Splunk Observability Cloud. The guidelines in the official documentation are not working at all. We have to deploy processes in our own way, and the documentation works only in 60 percent of the conditions, leaving the remaining 40 percent as problematic and needing improvement.
For how long have I used the solution?
I have used Splunk Observability Cloud for nearly one to one and a half years.
What do I think about the stability of the solution?
I experienced a downtime with Splunk Observability Cloud one time. We were unable to access it for nearly one day, which took a lot of time to resolve. Normally, other tools do not take as much time, and I do not understand why Splunk took so long. From the vendor's end, they should address such issues in a much shorter timeframe. When downtime occurs, it raises concerns about how we measure and receive alerts, as everything needs to be in place.
What do I think about the scalability of the solution?
In terms of lowering the cost of unplanned digital downtime using Splunk Observability Cloud, I found that many users report it is expensive, especially at a large scale, which can be a concern for organizations with tight budgets. At a large scale it is good, but for start-ups and some medium-range companies, it is expensive and they cannot afford it, especially as the cost increases with respect to data volume and retention needs.
How are customer service and support?
Support wise, there are two kinds of support for Splunk Observability Cloud: bi-weekly support and on-call support, with one more being premium support. They need to decrease the price of premium on-call support because as an employee, we require credits to get premium support, and our organization does not have many credits. That is a point where it lagged, but with respect to the bi-weekly calls and on-call support, it was acceptable. Out of five, I can give three for normal support, and four for premium call support.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
Previously, we used Prometheus and Grafana.
Which other solutions did I evaluate?
In comparing Splunk Observability Cloud to other observability platforms I have worked with, I find no key differences in both pros and cons. The integration process is the same across the board, and I feel there is not a real differentiator, as everything is similar in terms of custom dashboards and APM features.
What other advice do I have?
We miss the synthetic monitoring and AI-related features in Splunk Observability Cloud, which I think means front-end monitoring. We touch only the main AWS monitoring and service map, APM, and that is what we are using.
Regarding the ability to enrich data with custom metrics in Splunk Observability Cloud, we configured our breaches based on application performance only. Every application has different SLAs and SLOs, and according to each application, we have configured alerts using baselines that got triggered. We correlate this with multiple factors, such as Java-based memory leaks or garbage collections, and we generate custom metrics with alerts for notification purposes, employing the Webhook URL of Microsoft Teams and Outlook.
The out-of-the-box customizable dashboards provided by Splunk Observability Cloud are effective in showcasing IT performance to business leaders. It offers a nice point, as when we correlate different charts, I get so many x-axis and y-axis options, and we can correlate with other metrics. We have formulas there to find ratios and averages, which was a nice experience offering so many options. We are using the f(x) functions with respect to maximum, minimum, and averages, which are quite good.
On a scale of one to ten where ten is the best, I would rate Splunk Observability Cloud differently. For the UI part, I would rate it an eight, but for the configuration part, I would rate it three to four, as the configuration and integration aspects are not good at all. Overall, I would rate Splunk Observability Cloud a three out of ten.
Which deployment model are you using for this solution?
On-premises
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.