What is our primary use case?
In my current project, I am handling a US customer who is completely focused on working with LLM and security. Their clients are expected to give us the data sources, and we provision from Greengrass, which works as an agent running on the end client to fetch the customer false alarms. It's primarily focused on the SOC 2 side, where data goes to the SOC dashboard or SOC data source, leveraging LLM to filter how many alarms are false positives versus true positives. In doing this, the NOC engineers observing the SOC 2 dashboard do not have to worry about how many are true positives; our LLM API or tool filters it out and indicates how many true positives and false positives are displayed on the dashboard. This significantly eases the burden on the NOC engineers.
We are utilizing Amazon EKS for our application because most components are LLM and require certain GPUs, so we have specific managed node groups and we also use Carpenter along with HPA in place. Whenever the need arises, process queues are listed in Redis for elastic cache, and all the lists are processed and read from Redis and handed over to the nodes dynamically using HPA. We leverage HPA and Carpenter within Amazon EKS for scaling or scaling out.
How has it helped my organization?
Amazon EKS is simple due to its support for diverse AWS tools and various integrations, significantly influencing my application development and management processes. There is the aspect of trust relationships and permissions. Every service we create involves setting a role, and all authentication processes link to a policy, deciding access. This seamless process extends across accounts thanks to ARN (Amazon Resource Names). With the security guardrails in place, services are accessed effortlessly while maintaining high security, giving me confidence in its management capabilities.
What is most valuable?
The HPA feature within Kubernetes is one aspect I appreciate about Amazon EKS; it's beneficial for scalability. The managed node groups or dedicated node groups with GPU capacity allow us to scale in or out depending on our capacity planning, and Carpenter, which we have provisioned for scaling, contributes to that. These features are not available by default in EC2s or ECS, where you lack control over the nodes. Many people attempt to adopt Lambda functions or serverless solutions, but that approach does not suffice for GDPR and HIPAA compliance as it demands a solid identity of the node and clarity on where it is being provisioned. Hence, we cannot depend on any of the Lambda or serverless components which may be unstable, and this often leads to increased billing due to the dynamic nature of scheduled procedures that require a standalone node rather than a transient one.
When it comes to the integration with IAM, I have two thoughts on the authentication process of Amazon EKS. Previously, when I began using AWS, there was something called the AWS auth config map within Amazon EKS. Initially, only the person who creates the cluster has access to it. Now, if you engage in enterprise roles—as I did while working within the UK for Santander—it presents challenges. Essentially, earlier only the creator had the master role for accessing the node and the onboarding process was rather manual, with companies relying on ServiceNow and other tools for onboarding new joiners or team members. Now, EKS API or EKS CTL has many default settings that are not enabled, you need to enable these. Most clusters are created using Terraform, and you need to create a role that can manage cross-account access, as many customers don't operate with a single account.
My previous employers, such as Fidelity Investment, Nokia, and Ford, have worked across multiple accounts, necessitating single sign-on. This setup allows for cross-account access to the cluster by employing EKS CTL APIs, which leverage single sign-on to onboard team members. As such, once the role has access to the cluster, it can onboard users as a dev user, admin, or tester, simplifying the onboarding process. This way, previously manual tasks can be automated, which is a significant improvement. Earlier, we had to make changes to the config map to onboard users, but with EKS CTL API, this integration between EKS, Kubernetes service, and the cloud side is improved tremendously, alleviating many worries.
Self-healing nodes assist in minimizing administrative burdens in my projects. Coming from a telecom background where I've worked over seven years, I'm familiar with a service called SON—self-organizing and self-healing functionality. At a logical level, these are the layers we interact with, but AWS handles the physical layer through their software components. For instance, if one node is not ready and you enable the auto mode feature, AWS manages that for you—IAM upgrades or any nodes malfunctioning. I've seen these features in the UI; I've enabled them, and every 10 or 15 days, patches roll out. I can check them via AWS Inspector to see if there are any node-level patches or AMI level patches necessary. AWS takes care of these issues automatically. I appreciate that I don't need to manually check the dashboard and apply upgrades one by one, which is a significant improvement.
I measure the impact of Amazon EKS on the organization's management of complex workloads in terms of effectiveness and efficiency through my background in development and systems. Initially, I spent five years as a Java developer before transitioning to DevOps. With my understanding of end-to-end application architecture, I assess workloads based on system and application planning. For example, when I worked on a data lake product in Fidelity Investment, I observed that the cloud onboarding process, including Amazon EKS, had roadmaps extending over five years—from 2019 to 2024. I understand the nuances of enterprise or legacy applications and any system-related complexities. It all boils down to two components: system planning and application planning. Initially, we identify the type of application—whether it is database-related or has high GPU demands. Most applications today involve GPUs, which tend to incur high costs, and often customers are unaware of how to handle dynamic workloads effectively. It's crucial to assess not just one part (system), but various elements CPU, memory, and IOPS since the underlying hardware interacts with those components regardless of domain. First, we need to evaluate the application's requirements—such as its dependency on node storage. With EKS, Kubernetes provides solutions CSI, CNI, and CRI. By understanding the application's demands, I can apply the right Kubernetes configurations for performance optimization, such as taking advantage of Amazon EKS's ability to adjust the container network interface settings to suit the client's workload requirements. This loose coupling allows us to optimize our resources irrespective of whether we're using on-prem or cloud environments.
What needs improvement?
It has been since 2019 that I started using Amazon EKS. At that time, it was completely new, and many people were not using it just yet; it started from version 1.21, and right now we are on 1.33. Recently, 1.34 has been launched, but it's not yet available in the service catalog; we can see only 1.33. A lot of improvements have been made.
We had numerous add-ons to install manually because Kubernetes is a completely different service than AWS cloud provider, and everyone has opted to use it. After opting, there is an identity that you have to maintain—one at Kubernetes level and one at the AWS provider level. You have to maintain one identity at IAM level and one within the cluster, Amazon EKS. A few things do not make sense within the add-ons, many of the secret providers that read the secret from Secrets Manager and then mount it as a volume. We use a service called EBS CSI driver, which reads the secrets or sensitive data from Secrets Manager and then mounts it as a volume to the pod at runtime. However, that doesn't have a dynamic feature where, if any changes happen in the secrets, it can read and populate in the environment.
Sometimes consider your RDS password or OpenSearch password rotates. Amazon EKS doesn't have that feature to read the dynamic one and consider that the password has changed overnight; there is no functionality from the provider to see the changes and then restart the pod or fetch the new value. This often leads to downtime of 12 or even 6 hours, depending on when you realize it, so that needs improvement.
Nonetheless, mostly on the add-on side, they have developed a lot; earlier we were installing them manually, but now with EKS auto mode, many things VPC CLI and pod identity service—around four plugins—are installed by default, which is a good thing. However, I believe there should be some solution that is self-contained, covering generic use cases.
With the 1.33 release, they have addressed most of my earlier concerns, but I am still looking for some improvements, particularly in CloudWatch monitoring. In IT, we manage two aspects: either the system or the application. Currently, the application logs and monitoring are not very robust in CloudWatch; you can only find things if you are familiar with them. Fortunately, we are familiar, as most of the monitoring involves two types of databases: one is a time series for monitoring data, and the other is an indexing solution for a streaming service. This means we need to get the logs from each node, index them, and populate them on a screen. That part remains a separate service, but if they managed it within Amazon EKS service, where the monitoring is consolidated in one place, you wouldn't need to rely on Prometheus, Grafana, or different services. It would be advantageous to have a consolidated platform for EKS, as Kubernetes is leveraged; monitoring and logging should also be integrated simply by enabling parameters or tags. This would create a self-contained platform where people can onboard and start using it. Currently, I still need to enable logging and monitoring among other things myself; that shouldn't be the case after six or seven years in the market.
On a scale from 1 to 10, I would rate Amazon EKS tech support an eight. Some individuals have a deep understanding of the services and can identify potential bottlenecks, especially with load balancer endpoints and certificate management. The shift from NGINX to AWS load balancers has diminished many previous issues. However, not every support engineer meets the same level of expertise, hence why I rate it a solid eight, which I consider decent.
For how long have I used the solution?
I have been using Amazon EKS for seven years.
How are customer service and support?
Amazon's customer support has its merits; it is good overall. However, when it comes to enterprise licenses, the quality declines significantly. Startups may not recognize this at first, engaging with support during their initial phase, but they soon discover the lack of expert guidance and the costs associated with it—it's quite expensive.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
I have experience with other Kubernetes engines, such as those from Oracle, Azure, and Google. For instance, I find Red Hat OpenShift to be an excellent competitor. Many have transitioned to Azure due to the hosting incentives it offers for running open AI models. While people explore Amazon EKS or other Kubernetes solutions, the simplicity in moving between platforms remains an advantage since your manifests mostly stay the same, needing only minor adaptations to services. My experience with Red Hat OpenShift tells me it's a robust solution with competitive attributes against AWS.
What was our ROI?
In terms of ROI, the return on investment, I have clear examples. When we began, we utilized managed nodes for applications needing dynamic processes. Some applications simply required pod creation once data was queued for processing, allowing the node to remain free afterward. For a customer who was not cost-effective—we had provisioned a node that saw little CPU usage but substantial memory consumption—we implemented effective resource quotas and HPA. This setup enables the system to utilize resources dynamically based on the actual demand observed by reading metrics from the Prometheus adapter. Notably, the adapter isn't an out-of-the-box feature provided by Prometheus; you need to create your own adapter for it. Using tools HPA and Carpenter allowed us to scale resources based on requirements. Initially, not having them resulted in an unoptimized solution. However, with these tools in place, we witnessed a reduction of costs by approximately a third—if it was $100 beforehand, we brought costs down to $25.
What's my experience with pricing, setup cost, and licensing?
Regarding the pricing aspect and the licensing cost of Amazon EKS, sometimes it is not clear. Most discussions revolve around the data transfer costs from one region to another, and there are certain concerns regarding GPU nodes. However, if you optimize your node usage, with tools such as Kubecost, you can analyze how effectively you utilize your nodes. If you manage to optimize usage, you won't face steep costs. Otherwise, the cloud provider will certainly benefit from inefficient usage. Ultimately, it's not out of the box—if you want to monitor costs effectively, applying separate tools and acting accordingly in advance is essential.
Which other solutions did I evaluate?
I notice key differences between Amazon EKS and its competitors, analyzing both pros and cons. The seamless integration is sometimes lacking in other offerings. When managing software in platforms Kubernetes—including EKS, AKS, GKE, Rancher Kubernetes, and Oracle's Kubernetes engine—I've faced specific challenges, particularly with user management in Oracle's solution, which isn't as seamless as it is in Amazon EKS. Comparatively, OpenShift from Red Hat has notable strengths. Oracle is making improvements, especially with its longstanding database solutions. For cloud providers, though, OS from Red Hat is a formidable competitor, offering robust out-of-the-box solutions around resource limits and dashboard configurations that do not require command-line interventions.
What other advice do I have?
The review suggests that people considering Amazon EKS should heed some recommendations. They often attempt to enforce infrastructure as code with tools Terraform or HashiCorp to maintain workspace and all. I advise using services within a single environment, especially for LLM applications. It's prudent to have multiple LLM sources across various cloud providers while utilizing the same keys within your AWS environment.
My second piece of advice is to establish a separate CI/CD platform independent of AWS. This keeps things loosely coupled; with minimal tweaks in CI/CD pipelines, you can seamlessly migrate from one platform to another—say from EKS to AKS to GKE or OpenShift—thus keeping the focus on feature development rather than migration headaches. This leads to a modular approach in your code and infrastructure, ensuring that only the cloud provider specifics require adjustments.
Overall rating: 8 out of 10
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)