Gremlin Reliability Management Platform Reviews

Name: Gremlin Reliability Management Platform
Brand: Gremlin
Rating: 4.3 (8 reviews)

Vendor: Gremlin

4.3 out of 5

8 reviews
87% willing to recommend

Leave a review

What is Gremlin Reliability Management Platform?

Gremlin Reliability Management Platform empowers organizations to proactively identify and mitigate potential failures. It enhances system resilience through controlled chaos engineering, aiding tech teams in delivering reliable services.

Get the Gremlin Reliability Management Platform Buyer's Guide and find out what your peers are saying about Gremlin Reliability Management Platform and more!

Buyer's Guide

Gremlin Reliability Management Platform

June 2026

Get the report

Helped 902,417 peers since 2012

Featured Gremlin Reliability Management Platform reviews

Varun Lellapalli

Senior Software Engineer at a sports company with 10,001+ employees

The best features of Gremlin Reliability Management Platform are the safe failure injection, which is crucial as we can simulate the failures in a manner that we know these are just dumping tests and not the actual issues. Whether it is the CPU spike or the memory exhaustion, or the network latency, or the server shutdown, server shutdown is one of the most favorite features that I have in Gremlin Reliability Management Platform. The controlled blast radius is another standout feature. The controlled blast radius feature has helped my team in that we actually wanted to target only one specific container, our Docker containers that we deployed. It helped us to conduct tests in a very specific, isolated manner instead of launching a larger test or focusing on hundreds of servers at a time, resulting in very limited impact. Since ours is a very small team, we do not want to impact other servers. This controlled blast radius helped us to only focus on our servers and not impact any other team. Gremlin Reliability Management Platform has positively impacted my organization because before Gremlin Reliability Management Platform, we did not even know how to conduct these chaos engineering tests. We heard about it, but we had no idea of how to do something of that nature. If there are ten servers, ten systems in our architecture and if suddenly something goes down, nobody knew what would happen next. We did not even know how to simulate these types of tests. This lack of confidence has been mitigated by using Gremlin Reliability Management Platform. Now we can confidently test and see which system is the most critical. If this goes down, what happens? How much business valuation are we going to impact? How much loss are we going to incur? All of this is now clearly visible and transparent. Since using Gremlin Reliability Management Platform, we were able to reduce the incidents by six percent after conducting our limited experiments. We were also able to increase the uptime from ninety-eight to ninety-nine, which represents a one percent increase in uptime.

Read full review

Ravi Konduru

VP Global at a tech vendor with 10,001+ employees

There are certain areas where I think Gremlin Reliability Management Platform can improve. I would certainly add features related to AI and GenAI for recommendations. While dependency identification works well seamlessly, deeper dependency intelligence is lacking. When you have deeper dependencies, reliability management can struggle with identifying those deeper dependency intelligences. Having the intelligence to deeply analyze dependencies will be very helpful. Regarding reliability scores, while those scores are good, having more actionable reliability scores is something I would recommend adding to the existing system. Initially, when I started with Gremlin Reliability Management Platform certification programs, I completed two certifications with Gremlin—one professional and another practitioner. While that certification has helped me, an easier onboarding and learning curve for bringing people into this program can be adjusted. That is one current gap I see. If you really look at the cost-benefit visibility, it is not very evident by using Gremlin Reliability Management Platform. If Gremlin Reliability Management Platform could help realize that and bring in visibility regarding the cost versus the benefit, that is the reason I provide a score less than ten. From a standpoint of simulating complex real-world failures, I believe there is still a gap concerning gap identification. Currently, Gremlin Reliability Management Platform mainly focuses on infrastructure-level failures and does not really simulate business logic failures, data corruption scenarios, or potential failures across regions. One of the key elements needing improvement with Gremlin Reliability Management Platform is the limited team workflow integration I see in my organization. Collaboration and limited team workflow integration are areas I would highlight as needing improvement.

Read full review

Vinaykumar Vishwakarma

DEVOPS specialist at a media company with 10,001+ employees

I think Gremlin Reliability Management Platform can be improved by integrating with more AWS services or GCP services. I also think we can somehow integrate it with machine learning or perhaps some sort of AI by utilizing natural language processing so that it will be easier to interact with non-technical persons as well. We need more services and more prebuilt plugins for Gremlin Reliability Management Platform, especially for stress testing. I want to see how it can be integrated with machine learning, particularly on the NLP side. If we can integrate it with natural language, could we talk to Gremlin Reliability Management Platform and have it configure some of the basic settings so that non-technical persons can also work on Gremlin Reliability Management Platform-like tools? Even a QA person should be able to integrate it without needing any DevOps or cloud expertise.

Read full review

Gremlin Reliability Management Platform mindshare

Product category:

As of June 2026, the mindshare of Gremlin Reliability Management Platform in the Application Performance Monitoring (APM) and Observability category stands at 0.2%, according to calculations based on PeerSpot user engagement data.

Application Performance Monitoring (APM) and Observability Mindshare Distribution
Product	Mindshare (%)
Gremlin Reliability Management Platform	0.2%
Dynatrace	5.3%
Datadog	4.6%
Other	89.9%

Application Performance Monitoring (APM) and Observability

PeerResearch reports based on Gremlin Reliability Management Platform reviews

Type	Title	Date
Category	Application Performance Monitoring (APM) and Observability	Jun 27, 2026	Download
Product	Reviews, tips, and advice from real users	Jun 27, 2026	Download
Comparison	Gremlin Reliability Management Platform vs Datadog	Jun 27, 2026	Download
Comparison	Gremlin Reliability Management Platform vs Dynatrace	Jun 27, 2026	Download
Comparison	Gremlin Reliability Management Platform vs Splunk AppDynamics	Jun 27, 2026	Download

Key learnings from peers

Last updated May 17, 2026

Valuable Features

The most valuable features of Gremlin Reliability Management Platform are standardized reliability tests, automated testing and scheduling, safe fault injection, automatic risk detection, and dependency mapping. Users highlight chaos engineering, failure simulation, and controlled blast radius as key strengths. The platform enhances team confidence, reduces downtime, and increases efficiency and reliability. Offering comprehensive dashboards and flexibility, it helps users test and improve system performance, identify weaknesses, and proactively mitigate risks.

"Gremlin Reliability Management Platform has positively impacted our organization by making outages less frequent and improving recovery time significantly, resulting in fewer complaints on the customer success side and overall optimization of our DevOps process."
"More than anything, we fix failures even before they occur, which is basically proactive risk detection and risk mitigation."
"Gremlin Reliability Management Platform has impacted my organization positively as it helped a lot and reduced our failures, allowing us to find critical pinpoints in our application that had existed for three to ten months and led to too many improvements, reduced downtime, and a smoother experience for our application on AWS."

Room for Improvement

Gremlin Reliability Management Platform could enhance AI-driven dependency analysis, improve cost-benefit visibility, and integrate better with AWS, GCP, and machine learning. Users find the onboarding process challenging and desire improved team collaboration features. The platform could benefit from free learning resources and pay-as-you-go pricing. Expanding simulation capabilities beyond infrastructure failures and adding open-source features may attract more users. Enhancing the UI for developers and introducing NLP integration for non-technical users would be beneficial.

"I rate it an eight because we are still using it on a trial and error basis, and the pricing could be optimized for better cost visibility and ROI tracking."
"If you really look at the cost-benefit visibility, it is not very evident by using Gremlin Reliability Management Platform."
"Gremlin Reliability Management Platform can be improved as the pricing is a bit expensive and the learning curve for beginners is a bit difficult."

ROI

Users experienced a return on investment with Gremlin Reliability Management Platform by needing fewer employees to conduct tests, saving significant time in identifying errors, reducing costs associated with on-call settings and CX operations due to fewer complaints. Downtime and failure rates were notably decreased, leading to less production issues. During Chaos Engineering experiments, time savings were realized without the need for extensive metric analysis, allowing experimentation directly in a production environment.

Pricing

Enterprise buyers find Gremlin Reliability Management Platform pricing slightly high, yet justified by its benefits, particularly for large-scale systems with essential SLAs. The lack of visibility via dashboards can complicate cost-benefit comprehension. Although not cheap, its robust features and value are recognized, especially when companies take ownership of setup costs. Cost experiences vary, with some companies benefiting from non-financially restrictive licenses.

Popular Use Cases

Organizations use Gremlin Reliability Management Platform for chaos engineering to test system resilience under stress, simulate failures, and identify infrastructure weaknesses. It supports dependency mapping, CPU spike simulation, network latency testing, and Kubernetes failure analysis. Users conduct experiments on AWS, GCP, and identify auto-scaling issues. It assists in monitoring, improving reliability, and ensuring application robustness. Organizations gain insights into failure mechanisms and receive reliability scores, enhancing system performance before incidents occur.

Service and Support

Gremlin Reliability Management Platform's customer service is praised for being responsive and helpful. Users appreciate the extensive online resources and the effectiveness of Zoom meetings for troubleshooting. Many find the platform stable enough that direct interaction with support is rare. Different subscription models are available, and the expert partnership model stands out. Ratings consistently place the support team highly, with scores of eight to ten out of ten indicating satisfaction with the service provided.

Scalability

Users find Gremlin Reliability Management Platform scalable, praising its seamless workload management and effective handling of larger teams and services. While its scalability relies on the infrastructure, experiences have been positive. It scales smoothly for chaos experiments, supports sizable workloads, and equips teams with governance mechanisms essential for chaos engineering. This ease contributes to a strong culture within developer operations. With AWS or GCP, the platform expands efficiently across interconnected service dependencies.

Stability

Gremlin Reliability Management Platform demonstrates consistent stability without downtime or performance issues. Its availability is commendable, maintaining reliability across operations. Users widely affirm its stable behavior, with no negative occurrences reported.

These insights are based on the in-depth reviews provided by peers to help you make a better buying decision.

Download our Gremlin Reliability Management Platform Buyer's Guide for additional reliable information.

Review data by company size

By reviewers
Company Size	Count
Small Business	3
Large Enterprise	5

By reviewers

By visitors reading reviews
Company Size	Count
Small Business	45
Midsize Enterprise	17
Large Enterprise	30

By visitors reading reviews

Top industries

By visitors reading reviews

Construction Company

13%

Printing Company

10%

Financial Services Firm

Sports Company

Media Company

Real Estate/Law Firm

Healthcare Company

Outsourcing Company

Comms Service Provider

Manufacturing Company

University

Educational Organization

Computer Software Company

Government

Insurance Company

Wholesaler/Distributor

Marketing Services Firm

Performing Arts

Recreational Facilities/Services Company

Hospitality Company

Consumer Goods Company

Learn more about Gremlin Reliability Management Platform

Designed for tech-savvy users, Gremlin enables teams to implement chaos engineering effectively to ensure system reliability. It offers precise control over variables, allowing teams to simulate real-world scenarios and fortify system operations. Gremlin plays a strategic role in preventing downtime and maintaining optimal service delivery through a suite of advanced tools tailored for IT infrastructure.

What are the most important features of Gremlin?

Attack Library: Offers diverse failure scenarios for comprehensive testing.
Security Control: Ensures safe execution of tests with access restrictions.
Detailed Reporting: Provides insights into system weaknesses and improvements.
API Access: Facilitates automation and integration with existing systems.

What benefits should users look for in reviews?

Increased Uptime: Improved system availability through proactive testing.
Cost Efficiency: Reduced need for corrective measures post-failure.
Team Collaboration: Enhances coordination among IT and operations teams.
Product Reliability: More robust and reliable service delivery to clients.

In industries such as e-commerce, finance, and healthcare, Gremlin helps maintain service reliability by identifying vulnerabilities before they affect operations. IT teams can simulate stress tests specific to their industry, ensuring systems are resilient against potential threats, enhancing customer satisfaction, and securing business continuity.

Product Categories

Application Performance Monitoring (APM) and Observability

IT Infrastructure Monitoring

DevSecOps

Gremlin Reliability Management Platform Reviews Summary
Author info	Rating	Review Summary
Senior Software Engineer at a sports company with 10,001+ employees	4.5	I use Gremlin for chaos engineering, significantly boosting confidence, increasing uptime, and providing strong ROI through failure injection. Despite its expense and learning curve, the stable platform is impactful, making it a valuable 9/10 tool.
VP Global at a tech vendor with 10,001+ employees	4.5	I use Gremlin to proactively test failures and improve system reliability, especially with dependency mapping and safe fault injection. While it boosts confidence and offers great features, better cost-benefit visibility and deeper dependency intelligence are needed.
DEVOPS specialist at a media company with 10,001+ employees	4.0	I use Gremlin for chaos engineering on Kubernetes, valuing its prebuilt tests and automated scheduling. It boosted our production confidence, cutting issues by 30%. I seek more cloud integrations and AI for broader usability, giving it an 8/10.
Dev Ops To Development (IT) at a non-tech company with self employed	4.5	I use Gremlin for chaos testing, boosting infrastructure reliability over 50%. I value its flexibility and dashboard but desire more free learning resources and better integration with observability platforms like Splunk.
DevOps & Mlops Engineer at a printing company with 1-10 employees	5.0	I use Gremlin for Chaos Engineering, leveraging its built-in experiments to get reliability scores and insights for my web services. It's stable, scalable, and helps save time, though I wish it had open-source features. The support is great.
Documentation Engineer at a tech vendor with 1,001-5,000 employees	4.0	I find Gremlin excellent for simulating extreme stress to test application resilience, identify weaknesses, and reduce outages. While it significantly improved our recovery time and reduced downtime, I believe the UI and pricing could be optimized.
Site Reliability Engineer at a tech services company with 10,001+ employees	3.0	I use the Enterprise Reliability Platform to maintain reliability, finding it significantly increased efficiency and reliability. My organization measured improvements in latency and SLOs, and I have no recommendations for improvement after one year of use.
Performance Test Engineer at a educational organization with 51-200 employees	5.0	I use Gremlin for chaos engineering on AWS Kubernetes, finding critical failures and reducing downtime. Its flexibility, ease of use, and templates delivered significant ROI, proving more effective than Amazon FIS. I highly recommend it.

Varun Lellapalli

Senior Software Engineer at a sports company with 10,001+ employees

Mar 11, 2026

Chaos experiments have revealed weak points and now provide controlled cost-saving tests

What is our primary use case?

My main use case for Gremlin Reliability Management Platform is that we wanted to do chaos engineering, and in order for us to orchestrate the tests better, Gremlin helped us a lot.

A quick specific example of a chaos engineering test I've run using Gremlin is that one use case that actually helped us was to simulate a CPU spike on one of our servers, because it was harder for us in production to simulate a spike in CPU servers as we need. Gremlin helped us to spike the CPU servers.

I have a lot to add about how I'm using Gremlin Reliability Management Platform, as there were many experiments that have actually helped us. Auto-scaling was one thing that we actually wanted to see how it works. It was difficult for us to experiment and see how different auto-scaling strategies are working based on CPU utilization and whether they will automatically scale down. We wanted to see it live if it is happening because it relates directly and correlates to the costing of our services on the cloud. Using Gremlin Reliability Management Platform, when we launched some CPU spikes and intentionally reduced the utilization of an API, we were able to see the auto-scaling up and down. It helped us save a lot of costs and select the right instances.

What is most valuable?

The controlled blast radius feature has helped my team in that we actually wanted to target only one specific container, our Docker containers that we deployed. It helped us to conduct tests in a very specific, isolated manner instead of launching a larger test or focusing on hundreds of servers at a time, resulting in very limited impact. Since ours is a very small team, we do not want to impact other servers. This controlled blast radius helped us to only focus on our servers and not impact any other team.

Gremlin Reliability Management Platform has positively impacted my organization because before Gremlin Reliability Management Platform, we did not even know how to conduct these chaos engineering tests. We heard about it, but we had no idea of how to do something of that nature. If there are ten servers, ten systems in our architecture and if suddenly something goes down, nobody knew what would happen next. We did not even know how to simulate these types of tests. This lack of confidence has been mitigated by using Gremlin Reliability Management Platform. Now we can confidently test and see which system is the most critical. If this goes down, what happens? How much business valuation are we going to impact? How much loss are we going to incur? All of this is now clearly visible and transparent.

Since using Gremlin Reliability Management Platform, we were able to reduce the incidents by six percent after conducting our limited experiments. We were also able to increase the uptime from ninety-eight to ninety-nine, which represents a one percent increase in uptime.

What needs improvement?

Gremlin Reliability Management Platform can be improved as the pricing is a bit expensive and the learning curve for beginners is a bit difficult. It is not easy to get along with, and we need pretty good time to understand and grasp those concepts before we can use it. The infrastructure also needs to be very mature; it should be set up properly and that takes a lot of compliance and regulation time.

For how long have I used the solution?

I have been using Gremlin Reliability Management Platform for a couple of years now. I think it has been two years since we started using it.

What do I think about the stability of the solution?

Gremlin Reliability Management Platform is quite stable, and I have not seen any downtime or issues with its behavior or performance.

What do I think about the scalability of the solution?

The scalability of Gremlin Reliability Management Platform depends on the scalability of the underlying infrastructure that we are hosting it on. So far for us, it has been pretty good and clean with no issues.

How are customer service and support?

My interaction with customer support has not been quite often as we never had any requirements where we needed their help. The platform was quite stable.

How would you rate customer service and support?

Negative

Which solution did I use previously and why did I switch?

I did not use any different solution before Gremlin Reliability Management Platform. That is the only reliability management platform I have used, and it is pretty good.

How was the initial setup?

My experience with pricing, setup cost, and licensing is that it was a bit expensive, but most of it is handled by our team. I was not involved in the payment of it, as it was handled by the payments team.

What was our ROI?

I have seen a return on investment since using Gremlin Reliability Management Platform because fewer employees are needed now to conduct more reliable tests. If we needed ten people to do tests once upon a time, now, using Gremlin Reliability Management Platform, we can do it with a fifty percent reduction in employees. Only five people with Gremlin Reliability Management Platform can conduct much more reliable tests.

Which other solutions did I evaluate?

I did not evaluate any other platforms before choosing Gremlin Reliability Management Platform; we directly went to Gremlin Reliability Management Platform.

What other advice do I have?

There were a lot of good examples and great documentation for Gremlin Reliability Management Platform, which is something that I appreciate. It helped us a lot.

My advice for others looking into using Gremlin Reliability Management Platform is that in the starting stages, it will take some time to understand its capabilities, what it can do, and what it cannot do. The learning curve is a bit difficult, but once you understand it, it is a pretty great product to use. I rate Gremlin Reliability Management Platform a nine out of ten because, as I mentioned, the learning curve and the pricing made me reduce that one point.

Which deployment model are you using for this solution?

Private Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Ravi Konduru

VP Global at a tech vendor with 10,001+ employees

Apr 22, 2026

Proactive failure testing has increased confidence and reliability across complex microservices

What is our primary use case?

The primary reason I am using Gremlin Reliability Management Platform is to proactively test failures, identify weaknesses in my system, and fix them before real incidents actually occur.

From a proactive test failure standpoint, I was able to break the ice in terms of fear of breaking in production while using Gremlin Reliability Management Platform. I was able to make sure my teams do not hesitate in terms of pushing the code into production with all the safety nets that we have. Additionally, the blast radius has been controlled tremendously through our test program which minimizes the impact on many users, and by increasing the observability of our platform as well.

Out of those features, I rely upon dependency mapping test case testing very often in my day-to-day work. Dependency mapping automatically discovers service dependencies and also tests scenarios such as dependency failures and increased latency. Especially for my microservices type of architecture, these failures cascade very easily. Using Gremlin Reliability Management Platform, I was able to ensure that those microservices cascading failures are restricted through dependency mapping and dependency testing.

What is most valuable?

The features that Gremlin Reliability Management Platform offers are comprehensive. The top five features that I can certainly identify are standardized reliability test suite, automated testing and scheduling, reliability score as the key differentiator, automatic risk detection, and dependency mapping and testing. These are the top five use cases I can think of regarding the greatest unique selling propositions for Gremlin Reliability Management Platform.

I want to add something else about the features of Gremlin Reliability Management Platform, which is what you call safe fault injection. From a chaos engineering perspective, one of the core principles of chaos engineering is injecting failures safely into a controlled environment and making sure the systems actually work dynamically without having any issues. Gremlin Reliability Management Platform automatically tests these conditions and checks their health, which is really an awesome feature with respect to testing the core chaos engineering principles.

Gremlin Reliability Management Platform has positively impacted my organization by providing confidence to our DevOps engineers and SRE team. Whatever the development teams are actually pushing code has been rigorously tested before it even goes into production. We were able to ensure team members could rely on our code check-ins, and the DevOps engineers can push code into production with the safety nets that we have around. Continuous automation, continuous validation, automation test executions, and traceability of changes are happening, along with measurable reliability evidenced by dashboards and scores. More than anything, we fix failures even before they occur, which is basically proactive risk detection and risk mitigation. These are all the key benefits we were able to offer to our customers, development partners, and testing partners.

What needs improvement?

Initially, when I started with Gremlin Reliability Management Platform certification programs, I completed two certifications with Gremlin—one professional and another practitioner. While that certification has helped me, an easier onboarding and learning curve for bringing people into this program can be adjusted. That is one current gap I see.

If you really look at the cost-benefit visibility, it is not very evident by using Gremlin Reliability Management Platform. If Gremlin Reliability Management Platform could help realize that and bring in visibility regarding the cost versus the benefit, that is the reason I provide a score less than ten. From a standpoint of simulating complex real-world failures, I believe there is still a gap concerning gap identification. Currently, Gremlin Reliability Management Platform mainly focuses on infrastructure-level failures and does not really simulate business logic failures, data corruption scenarios, or potential failures across regions.

One of the key elements needing improvement with Gremlin Reliability Management Platform is the limited team workflow integration I see in my organization. Collaboration and limited team workflow integration are areas I would highlight as needing improvement.

For how long have I used the solution?

I have been using Gremlin Reliability Management Platform for over twelve months.

What do I think about the stability of the solution?

Gremlin Reliability Management Platform is stable.

What do I think about the scalability of the solution?

Most of our customers typically use AWS or GCP when using Gremlin Reliability Management Platform.

How are customer service and support?

The customer support for Gremlin Reliability Management Platform is good. There is a wealth of online material available for us to reference, and we primarily rely on raising tickets for additional support. It does offer different subscription models, and the expert partnership model is a significant strength I can suggest for Gremlin Reliability Management Platform.

Which solution did I use previously and why did I switch?

Some of my customers previously used different tools for observability and their own custom in-house chaos engineering platforms. They are not very effective, and customers are not satisfied with their in-house solutions. This is where they awarded the project to Coforge to help them transition to Gremlin Reliability Management Platform, which has delivered substantial benefits. One of our customers was not using any existing systems at all, so we proposed Gremlin Reliability Management Platform.

What about the implementation team?

Most of these reliability management platforms are actually bought through our customer establishments. As a service provider, Coforge did not procure any licenses for Gremlin Reliability Management Platform.

What's my experience with pricing, setup cost, and licensing?

From a pricing standpoint of view regarding Gremlin Reliability Management Platform, I would say it is a bit expensive, but that expense is worth it given the kind of benefits it offers. The challenge lies in convincing customers about the cost and benefit, which is not clearly presented in the system. This lack of visibility in the form of dashboards poses a key challenge for me.

If we consider the cost versus the value perception, Gremlin Reliability Management Platform is certainly useful for large-scale systems. Especially when SLAs or downtime SLAs are in effect, those companies cannot afford to lose time due to infrastructure failures. Companies in this situation would not mind investing because the cost versus value perception is very high on that front.

Which other solutions did I evaluate?

I did not evaluate any other options before choosing Gremlin Reliability Management Platform because I already hold two certifications from Gremlin Reliability Management Platform. I considered Gremlin Reliability Management Platform to be the first choice and did not compare any other tools.

What other advice do I have?

I would certainly suggest others venture into Gremlin Reliability Management Platform, as there is no second thought about it. However, I would not recommend jumping straight into production chaos. My advice is to start small, define clear reliability goals first, and follow the right protocols. Invest in observability first, then assign clear ownership with each one of the owners through a RACI matrix. Finally, integrate it into your workflows. That is probably the best advice I could offer to newcomers entering the reliability management program.

From a final takeaway standpoint, using Gremlin Reliability Management Platform successfully means shifting your mindset from fixing problems when they happen to continuously proving our systems can handle various types of failures.

I would rate my overall experience with Gremlin Reliability Management Platform a nine out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other

Vinaykumar Vishwakarma

DEVOPS specialist at a media company with 10,001+ employees

Feb 27, 2026

Chaos testing has increased confidence in Kubernetes reliability and reduced production issues

What is our primary use case?

My main use case for Gremlin Reliability Management Platform is to test. We are running a Kubernetes cluster on GCP, and we want to check our clusters, especially node reliability for the HA use case. The way we used to check the Kubernetes cluster is that we have multiple nodes with multiple tags on nodes, and we are deploying different applications on different nodes to ensure that all the nodes are up. We are using Gremlin Reliability Management Platform for chaos engineering to check those nodes in a pre-prod environment. Sometimes, we also check EC2 instances on Amazon.

What is most valuable?

The best feature that Gremlin Reliability Management Platform offers for me is the prebuilt reliability test; I think that is the best feature along with the automated scheduling. These are the best features that I can mention.

Gremlin Reliability Management Platform has positively impacted our organization by providing us with more confidence in production. We are more confident about running chaos in production, and related to the prebuilt test, we have some scalability tests, especially regarding the infrastructure side, such as CPU tests, memory tests, or disk tests, as I mentioned earlier. If CPU pushes more than seventy-five percent, we make sure that services scale and behave correctly. If memory moves from the threshold of more than seventy-five percent to eighty percent, then we take action accordingly, and we also conduct some redundancy host tests that I mentioned before. We are more confident about the production environment, and we have significantly reduced our issues in production by thirty percent.

What needs improvement?

For how long have I used the solution?

I have been using Gremlin Reliability Management Platform for two years.

What do I think about the stability of the solution?

Gremlin Reliability Management Platform is stable; it is quite stable.

What do I think about the scalability of the solution?

The scalability of Gremlin Reliability Management Platform is good; it is scalable.

How are customer service and support?

The customer support for Gremlin Reliability Management Platform is good overall; the documentation is good.

How would you rate customer service and support?

Which solution did I use previously and why did I switch?

I did not previously use a different solution before Gremlin Reliability Management Platform.

What was our ROI?

We are seeing a return on investment from using Gremlin Reliability Management Platform because we are getting less production issues by thirty percent, as I mentioned earlier, making it a great investment. Now we are free at least on long weekends, knowing what the issues are, and that is a great thing.

Which other solutions did I evaluate?

Before choosing Gremlin Reliability Management Platform, I did not evaluate other options.

What other advice do I have?

I rate Gremlin Reliability Management Platform an eight out of ten. I give it an eight because I want to see improvements on the machine learning side, particularly how it can be integrated with NLP.

I chose eight out of ten for Gremlin Reliability Management Platform because it is one of the best tools in terms of chaos engineering. It also has ready-made templates, and we are more confident about the production environment, which saves our time, especially during long weekends.

I advise others looking into using Gremlin Reliability Management Platform to run it for production-grade applications, specifically on Kubernetes, and run production Kubernetes at scale. That is how we are using it for multi-node clusters, multi-zone deployment, and microservices architecture. We can replicate some of the production issues by ensuring that a node is down, allowing us to deploy without issues while maintaining visibility. Reliability score is the main metric for the enterprise solution, and we have standardized tests and a history of tracking trends.

ElenaElena

Dev Ops To Development (IT) at a non-tech company with self employed

Mar 2, 2026

Chaos testing has uncovered vulnerabilities and now drives stronger, more reliable infrastructures

What is our primary use case?

My main use case for Gremlin Reliability Management Platform is chaos testing. I take my infrastructure and then I sabotage some things to see how they reach the goal. I try network or infrastructure attacks mainly, and I play every code on Gremlin Reliability Management Platform. Regarding a memorable incident, I found a lot of vulnerabilities in some SMTP servers, and I fixed it with Gremlin Reliability Management Platform. It is interesting because Gremlin Reliability Management Platform is not a penetration tester, but by disrupting other parts of the infrastructure and then running some other tests, it serves this purpose effectively.

What is most valuable?

The best feature Gremlin Reliability Management Platform offers in my experience is having everything in one dashboard and the ability to perform tests of every kind of infrastructure. The flexibility is one of the main things about Gremlin Reliability Management Platform that I found, and it is really important. It is also important to have the possibility of targeting even specific or wider parts of infrastructure, and it is simple and well-thought-out to isolate things or put it in a more reasonable way.

Using Gremlin Reliability Management Platform has raised more than fifty percent of the reliability of the infrastructure. I do not own a single infrastructure of my own because I am a freelancer, and so I have many cases of customers, but the percentage of the average improvement is very huge. Mainly, we notice fewer incidents and less downtime. There are really two pathways along: fewer incidents because with Gremlin Reliability Management Platform, we can make every part of the infrastructure more solid, and less downtime because we can test more architectures and then things like how to put in high availability clusters. The impact in clients' environments is really significant, and it is one of the special things.

What needs improvement?

I think that it will be important to have resources to perform self-directed studies on Gremlin Reliability Management Platform as an improvement. There is a small and fast and simple certification, but if they add possibilities to learn and get certified for free, it would be great because it is very powerful and the documentation is very high quality. However, I do not think that only with the documentation you can reach all the complexity of the tool. Some learning paths, free and by webinar, could help.

I think it would be useful to have some integration with Splunk or other log collectors, or maybe in the future, the ability to link Dynatrace or any other observability platform.

For how long have I used the solution?

I have been using Gremlin Reliability Management Platform for about five years.

What do I think about the stability of the solution?

Gremlin Reliability Management Platform is stable.

What do I think about the scalability of the solution?

More than scalability, I thought about availability because it is a really important thing of the architecture tools, but I think it is also scalable with AWS.

How are customer service and support?

The customer support quality is very good. I would rate the customer support an eight on a scale of one to ten.

How would you rate customer service and support?

Positive

Which solution did I use previously and why did I switch?

I was born as a chaos tester with Gremlin Reliability Management Platform, and I think I will die with it professionally.

How was the initial setup?

I purchased Gremlin Reliability Management Platform through the AWS Marketplace.

What was our ROI?

I cannot share relevant metrics because my customers cover it with an NDA. However, it is not a general impression; the numbers are impressive because, as I said previously, reducing downtime and mainly reducing failures is significant.

What's my experience with pricing, setup cost, and licensing?

It is not so cheap, but it has very powerful features. For my experience with pricing, setup cost, and licensing, the value is there.

Which other solutions did I evaluate?

I found Gremlin Reliability Management Platform and discovered the free learning path, so I dove into it before choosing Gremlin Reliability Management Platform.

What other advice do I have?

The main advice I would give to others looking into using Gremlin Reliability Management Platform would be to study it. Do not be shy to fail. Test everything and do lab architectures to test. It is very important to have hands-on experience with tools of this caliber. I would rate this review a nine out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

reviewer2805747

DevOps & Mlops Engineer at a printing company with 1-10 employees

Mar 2, 2026

Chaos experiments have revealed reliability risks and provide clear reliability scores

What is our primary use case?

My main use case for Gremlin Reliability Management Platform is the Chaos Engineering part for software. A quick specific example of how I've used Gremlin Reliability Management Platform for Chaos Engineering in my work is with a web service we have, where we need to know the reliability score of it. We conducted chaos experiments with it, including a network experiment, black hole, CPU, and memory experiments, that create chaos for the service, and then we receive a reliability score reflecting the service's reliability, especially in a production environment.

Gremlin Reliability Management Platform is amazing with the reliability score. There is a built-in Chaos Engineering experiment that can help you to provide this to your service. You run it on your service, and then you receive the reliability score from Gremlin Reliability Management Platform, along with insights on the issues and risks present in your service that you can examine and work on.

What is most valuable?

One of my best features of Gremlin Reliability Management Platform is the built-in chaos experiments, which gives you the reliability score of your service.

The built-in chaos experiments and reliability scoring have helped me in my day-to-day work by making it easier to run the experiments directly instead of doing them manually one by one. It allows running scenarios for my web service, for example, and in terms of CPU, it runs the container in terms of Kubernetes from 25% to 75% CPU utilization, giving me more insights about how reliable my system is, making my approach easier for Gremlin Reliability Management Platform and Chaos Engineering.

Game Days can help you take a day with your team to experiment with your services in a production or pre-production environment, allowing you to see how reliable your system is, which is a great feature for the team to deep dive into Chaos Engineering.

Gremlin Reliability Management Platform has positively impacted my organization because we had clients come to us to implement Gremlin Reliability Management Platform as a Chaos Engineering platform for their use cases, which has gained us a lot of potential client opportunities as a consulting company. The reliability scores have improved, as built-in experiments give you the reliability scores, along with insights on risks you have, how you can manage and improve them, which is very helpful. In terms of faster incident response, especially in Kubernetes, if you have one container, Gremlin Reliability Management Platform flags the need for an HPA that will increase your reliability score for the service.

What needs improvement?

Gremlin Reliability Management Platform can be improved by introducing open-source features. It currently has a paid version, but introducing open-source features could encourage more people to use and try it.

The user interface is great, the integration is smooth, and Gremlin Reliability Management Platform has a fantastic support team that helps us a lot in many cases.

For how long have I used the solution?

I have been using Gremlin Reliability Management Platform for around two years, and I am certified in Gremlin Reliability Management Platform.

What do I think about the stability of the solution?

Gremlin Reliability Management Platform is stable with good availability and is very reliable.

What do I think about the scalability of the solution?

Gremlin Reliability Management Platform scales smoothly for running more chaos experiments, adding more services, or supporting a larger team. It can easily scale up your experiments for many of your services, and it can provide other experiments for interconnected dependencies.

How are customer service and support?

When I have questions or run into issues with Gremlin Reliability Management Platform, their support team is helpful and responsive. They resolve our problems quickly and provide assistance through Zoom meetings, which has been very effective in troubleshooting.

How would you rate customer service and support?

Negative

Which solution did I use previously and why did I switch?

We did not use any other solutions; we only started with Gremlin Reliability Management Platform.

What was our ROI?

I can see a return on investment because we save a lot of time during our Chaos Engineering experiments. We do not need to look at all the day's metrics on Grafana dashboards; we run our chaos experiments in a production environment to see how reliable our product or service is.

What's my experience with pricing, setup cost, and licensing?

My experience with pricing, setup cost, and licensing depends on the company. My role does not incur costs for us since we have an NFR for Gremlin Reliability Management Platform that we can use in our case.

Which other solutions did I evaluate?

I did not evaluate other options before choosing Gremlin Reliability Management Platform; the company did that, so I do not have an answer for that.

Which deployment model are you using for this solution?

On-premises

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Sayanta Banerjee

Documentation Engineer at a tech vendor with 1,001-5,000 employees

May 16, 2026

Chaos testing has revealed hidden weaknesses and drives stronger resilience in critical services

What is our primary use case?

My main use case for Gremlin Reliability Management Platform is to see how our applications behave under extreme stress and how resilient our application is when a simulation of server crash alongside increased network latency happens. We want to see how our applications can hold up to that.

Recently, I used Gremlin Reliability Management Platform to create a simulation where we increased the network latency significantly and killed one or two containers or pods in our Kubernetes cluster, testing how resilient our application is in such extreme stress scenarios. The results were impressive because it provided us insight into weaknesses such as poor failure mechanisms or scaling issues.

In addition to reliability testing, I also use Gremlin Reliability Management Platform for automated test scenarios, allowing us to perform dependency loss tests, latency testing, and scalability checks for our applications. It provides great disaster recovery validation checks and security certificate expiration checks.

What is most valuable?

The feature that excels the most in Gremlin Reliability Management Platform is the chaos engineering and failure simulation features because we simulate high server crashes with overwhelming network latency to test how our applications perform in a chaotic environment, providing great scenarios on where we are lacking and where we can improve our safety net.

We had an incident of complex memory exhaustion during a production deployment, and to prevent that from happening again, I used chaos engineering with Gremlin Reliability Management Platform to trigger CPU and memory exhaustion. This led us to understand how our application withstands those situations in a simulated environment and identify hidden weaknesses.

Gremlin Reliability Management Platform has positively impacted our organization by making outages less frequent and improving recovery time significantly, resulting in fewer complaints on the customer success side and overall optimization of our DevOps process. Although I'm not directly involved in business metrics, I heard that our downtime has reduced by at least fifteen percent on a quarter-over-quarter basis.

What needs improvement?

While I have no complaints about Gremlin Reliability Management Platform, I believe the UI can be improved to enhance the developer experience for security engineers and DevOps engineers. Additionally, AI-driven root cause analysis could provide more visibility for SRE teams.

I rate it an eight because we are still using it on a trial and error basis, and the pricing could be optimized for better cost visibility and ROI tracking. Otherwise, I believe it could achieve a ten.

The experience with pricing, setup cost, and licensing is moderately good but can improve through options such as having a pay-as-you-go pricing model.

For how long have I used the solution?

I have been working as a Documentation Engineer for around four years and five months.

What do I think about the stability of the solution?

Gremlin Reliability Management Platform is stable.

What do I think about the scalability of the solution?

Gremlin Reliability Management Platform's workload management capability is good, effectively managing large workloads seamlessly while providing safety mechanisms and governance around chaos engineering. This fosters a good SRE culture in our DevOps company.

How are customer service and support?

The customer support is good, with only occasional hiccups.

Which solution did I use previously and why did I switch?

We did not use any solutions previously; we only monitored our systems through our observability stack with Prometheus or Grafana.

What was our ROI?

The biggest impact we have seen is cutting down costs on on-call settings because complaints on downtimes and availability issues have reduced significantly, thereby lowering our CX operational load.

What other advice do I have?

For others considering Gremlin Reliability Management Platform, it is an excellent tool for organizations facing downtime issues, as it allows for chaos testing without needing to check logs and metrics, enabling extreme environment testing. I rate this product an eight.

Which deployment model are you using for this solution?

Private Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

reviewer2783910

Site Reliability Engineer at a tech services company with 10,001+ employees

Dec 3, 2025

Platform has improved reliability metrics but still raises questions about overall value

What is our primary use case?

The Enterprise Reliability Platform serves as my main use case for the next question.

A quick specific example of how I use The Enterprise Reliability Platform to maintain reliability and efficiency is that we have our own internal system to track and maintain the reliability and efficiency.

What is most valuable?

The Enterprise Reliability Platform has positively impacted my organization as it has significantly increased the efficiency and reliability of our systems.

I measured that increase in efficiency, and I can share that the metrics I noticed include latency and the SLOs, error budget, and not burning through the error budgets.

What needs improvement?

I have no recommendations for how The Enterprise Reliability Platform can be improved.

For how long have I used the solution?

I have been using The Enterprise Reliability Platform for one year.

What other advice do I have?

I have no answer regarding the best features The Enterprise Reliability Platform offers.

I would provide no advice to others looking into using The Enterprise Reliability Platform.

My company does not have a business relationship with this vendor other than being a customer.

I was not offered a gift card or incentive for this review.

I do not have any additional thoughts about The Enterprise Reliability Platform before we wrap up.

I gave this review a rating of 6.

Which deployment model are you using for this solution?

Hybrid Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Praneel Zaudari

Performance Test Engineer at a educational organization with 51-200 employees

Apr 14, 2026

Continuous testing has uncovered long‑hidden AWS Kubernetes failures and reduced downtime

What is our primary use case?

My main use case for Gremlin Reliability Management Platform is to analyze the failures in AWS. I cannot provide details about how I use Gremlin Reliability Management Platform to analyze periods in AWS as it is not according to company policy, but I have used Gremlin Reliability Management Platform to find and to test the Kubernetes of the application we have deployed on AWS. I do not have anything else to add about my main use case with Gremlin Reliability Management Platform, including how often I run these tests or what kind of results I typically look for. Gremlin Reliability Management Platform is deployed in our organization on the public cloud.

What is most valuable?

Gremlin Reliability Management Platform offers flexibility and simpleness of use of the application. Gremlin Reliability Management Platform offers me the ability to flexibly use the templates already for testing Kubernetes and everything, which helps me a lot.

Gremlin Reliability Management Platform has impacted my organization positively as it helped a lot and reduced our failures. We were able to find the pinpoints in our application, and they were the critical ones that existed for three to ten months, but nobody was able to find them. We used Gremlin Reliability Management Platform, and we were able to test the application in that way and find those actual points.

I noticed too many improvements in our application after using Gremlin Reliability Management Platform. The downtime reduced, and we were able to experience our application smoothly, making it a great application that helped us to improve our application which is deployed on AWS.

What needs improvement?

Gremlin Reliability Management Platform is already good, and I do not see any improvements in it. I do not want to add more about the needed improvements for Gremlin Reliability Management Platform, even small things, and I do not wish for anything to be a bit different.

For how long have I used the solution?

I have been using Gremlin Reliability Management Platform for six months.

What do I think about the stability of the solution?

Gremlin Reliability Management Platform is very stable and good.

What do I think about the scalability of the solution?

The scalability of Gremlin Reliability Management Platform is good.

How are customer service and support?

The customer support for Gremlin Reliability Management Platform is very good. I would rate the customer support for Gremlin Reliability Management Platform a ten on a scale of one to ten.

Which solution did I use previously and why did I switch?

I have not used a different solution. Gremlin Reliability Management Platform is the first solution I have used. I have used Amazon FIS, but Gremlin Reliability Management Platform was more effective.

How was the initial setup?

I did purchase Gremlin Reliability Management Platform through the AWS Marketplace.

What about the implementation team?

My company owned the pricing and setup cost for Gremlin Reliability Management Platform, so it is good and low as per the services, which is good.

What was our ROI?

I have seen a return on investment as it saved time. We were able to find out the errors in two to three days, while they were not able to be found in three to eight months, which is good. It saved a lot of time, and fewer employees were needed. Only two or three people were allocated to Gremlin Reliability Management Platform.

What's my experience with pricing, setup cost, and licensing?

My company owned the pricing and setup cost for Gremlin Reliability Management Platform, so it is good and low as per the services, which is good.

Which other solutions did I evaluate?

Before choosing Gremlin Reliability Management Platform, I evaluated Amazon FIS, which is the only other option I considered.

What other advice do I have?

I advise others looking into using Gremlin Reliability Management Platform that it is an application that is more effective than Amazon FIS for those who want to deep dive into their application to find out the actual reasons for downtimes. You can deep dive into the application and check by doing chaos engineering on the application. It has many templates which you can use to test your application with Kubernetes and everything, and it is good. I give this product a rating of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Gremlin Reliability Management Platform Reviews

What is Gremlin Reliability Management Platform?

Featured Gremlin Reliability Management Platform reviews

Gremlin Reliability Management Platform mindshare

PeerResearch reports based on Gremlin Reliability Management Platform reviews

Valuable Features

Room for Improvement

ROI

Pricing

Popular Use Cases

Service and Support

Scalability

Stability

Review data by company size

Top industries

Learn more about Gremlin Reliability Management Platform

Related questions

Product Categories

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?

How are customer service and support?

How would you rate customer service and support?

Which solution did I use previously and why did I switch?

How was the initial setup?

What was our ROI?

Which other solutions did I evaluate?

What other advice do I have?

Which deployment model are you using for this solution?

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?

How are customer service and support?

Which solution did I use previously and why did I switch?

What about the implementation team?

What's my experience with pricing, setup cost, and licensing?

Which other solutions did I evaluate?

What other advice do I have?

Which deployment model are you using for this solution?

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?

How are customer service and support?

How would you rate customer service and support?

Which solution did I use previously and why did I switch?

What was our ROI?

Which other solutions did I evaluate?

What other advice do I have?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?

What do I think about the scalability of the solution?

How are customer service and support?

How would you rate customer service and support?

Which solution did I use previously and why did I switch?

How was the initial setup?

What was our ROI?

What's my experience with pricing, setup cost, and licensing?

Which other solutions did I evaluate?

What other advice do I have?

Which deployment model are you using for this solution?

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

What is our primary use case?

What is most valuable?

What needs improvement?

For how long have I used the solution?

What do I think about the stability of the solution?