What is our primary use case?
The primary reason I am using Gremlin Reliability Management Platform is to proactively test failures, identify weaknesses in my system, and fix them before real incidents actually occur.
From a proactive test failure standpoint, I was able to break the ice in terms of fear of breaking in production while using Gremlin Reliability Management Platform. I was able to make sure my teams do not hesitate in terms of pushing the code into production with all the safety nets that we have. Additionally, the blast radius has been controlled tremendously through our test program which minimizes the impact on many users, and by increasing the observability of our platform as well.
Out of those features, I rely upon dependency mapping test case testing very often in my day-to-day work. Dependency mapping automatically discovers service dependencies and also tests scenarios such as dependency failures and increased latency. Especially for my microservices type of architecture, these failures cascade very easily. Using Gremlin Reliability Management Platform, I was able to ensure that those microservices cascading failures are restricted through dependency mapping and dependency testing.
What is most valuable?
The features that Gremlin Reliability Management Platform offers are comprehensive. The top five features that I can certainly identify are standardized reliability test suite, automated testing and scheduling, reliability score as the key differentiator, automatic risk detection, and dependency mapping and testing. These are the top five use cases I can think of regarding the greatest unique selling propositions for Gremlin Reliability Management Platform.
I want to add something else about the features of Gremlin Reliability Management Platform, which is what you call safe fault injection. From a chaos engineering perspective, one of the core principles of chaos engineering is injecting failures safely into a controlled environment and making sure the systems actually work dynamically without having any issues. Gremlin Reliability Management Platform automatically tests these conditions and checks their health, which is really an awesome feature with respect to testing the core chaos engineering principles.
Gremlin Reliability Management Platform has positively impacted my organization by providing confidence to our DevOps engineers and SRE team. Whatever the development teams are actually pushing code has been rigorously tested before it even goes into production. We were able to ensure team members could rely on our code check-ins, and the DevOps engineers can push code into production with the safety nets that we have around. Continuous automation, continuous validation, automation test executions, and traceability of changes are happening, along with measurable reliability evidenced by dashboards and scores. More than anything, we fix failures even before they occur, which is basically proactive risk detection and risk mitigation. These are all the key benefits we were able to offer to our customers, development partners, and testing partners.
What needs improvement?
There are certain areas where I think Gremlin Reliability Management Platform can improve. I would certainly add features related to AI and GenAI for recommendations. While dependency identification works well seamlessly, deeper dependency intelligence is lacking. When you have deeper dependencies, reliability management can struggle with identifying those deeper dependency intelligences. Having the intelligence to deeply analyze dependencies will be very helpful. Regarding reliability scores, while those scores are good, having more actionable reliability scores is something I would recommend adding to the existing system.
Initially, when I started with Gremlin Reliability Management Platform certification programs, I completed two certifications with Gremlin—one professional and another practitioner. While that certification has helped me, an easier onboarding and learning curve for bringing people into this program can be adjusted. That is one current gap I see.
If you really look at the cost-benefit visibility, it is not very evident by using Gremlin Reliability Management Platform. If Gremlin Reliability Management Platform could help realize that and bring in visibility regarding the cost versus the benefit, that is the reason I provide a score less than ten. From a standpoint of simulating complex real-world failures, I believe there is still a gap concerning gap identification. Currently, Gremlin Reliability Management Platform mainly focuses on infrastructure-level failures and does not really simulate business logic failures, data corruption scenarios, or potential failures across regions.
One of the key elements needing improvement with Gremlin Reliability Management Platform is the limited team workflow integration I see in my organization. Collaboration and limited team workflow integration are areas I would highlight as needing improvement.
For how long have I used the solution?
I have been using Gremlin Reliability Management Platform for over twelve months.
What do I think about the stability of the solution?
Gremlin Reliability Management Platform is stable.
What do I think about the scalability of the solution?
Most of our customers typically use AWS or GCP when using Gremlin Reliability Management Platform.
How are customer service and support?
The customer support for Gremlin Reliability Management Platform is good. There is a wealth of online material available for us to reference, and we primarily rely on raising tickets for additional support. It does offer different subscription models, and the expert partnership model is a significant strength I can suggest for Gremlin Reliability Management Platform.
Which solution did I use previously and why did I switch?
Some of my customers previously used different tools for observability and their own custom in-house chaos engineering platforms. They are not very effective, and customers are not satisfied with their in-house solutions. This is where they awarded the project to Coforge to help them transition to Gremlin Reliability Management Platform, which has delivered substantial benefits. One of our customers was not using any existing systems at all, so we proposed Gremlin Reliability Management Platform.
What about the implementation team?
Most of these reliability management platforms are actually bought through our customer establishments. As a service provider, Coforge did not procure any licenses for Gremlin Reliability Management Platform.
What's my experience with pricing, setup cost, and licensing?
From a pricing standpoint of view regarding Gremlin Reliability Management Platform, I would say it is a bit expensive, but that expense is worth it given the kind of benefits it offers. The challenge lies in convincing customers about the cost and benefit, which is not clearly presented in the system. This lack of visibility in the form of dashboards poses a key challenge for me.
If we consider the cost versus the value perception, Gremlin Reliability Management Platform is certainly useful for large-scale systems. Especially when SLAs or downtime SLAs are in effect, those companies cannot afford to lose time due to infrastructure failures. Companies in this situation would not mind investing because the cost versus value perception is very high on that front.
Which other solutions did I evaluate?
I did not evaluate any other options before choosing Gremlin Reliability Management Platform because I already hold two certifications from Gremlin Reliability Management Platform. I considered Gremlin Reliability Management Platform to be the first choice and did not compare any other tools.
What other advice do I have?
I would certainly suggest others venture into Gremlin Reliability Management Platform, as there is no second thought about it. However, I would not recommend jumping straight into production chaos. My advice is to start small, define clear reliability goals first, and follow the right protocols. Invest in observability first, then assign clear ownership with each one of the owners through a RACI matrix. Finally, integrate it into your workflows. That is probably the best advice I could offer to newcomers entering the reliability management program.
From a final takeaway standpoint, using Gremlin Reliability Management Platform successfully means shifting your mindset from fixing problems when they happen to continuously proving our systems can handle various types of failures.
I would rate my overall experience with Gremlin Reliability Management Platform a nine out of ten.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Other