What is our primary use case?
I have been involved in using
Amazon EMR for data engineering projects, specifically for the submission of the jobs to the data clusters. I also focus on resource utilization, monitoring the job status, and managing memory resources.
What is most valuable?
The cluster properties are the most recognized feature, and the integration with other tools in Big Data, Data Science, and AI areas is crucial.
Amazon EMR helps in scalability, real-time and batch processing of data, handling efficient data sources, and managing data lakes, data stores, and data marts on file systems and in
S3 buckets. It supports frameworks for handling structured and unstructured data.
What needs improvement?
There is room for improvement with respect to retries, handling the volume of data on
S3 buckets, cluster provisioning, scaling, termination, security, and integration between services like S3, Glue, Lake Formation, and DynamoDB.
For how long have I used the solution?
I have used the solution for almost ten years.
What was my experience with deployment of the solution?
Deployment is mostly based on how the job or pipeline utilizes it. It can depend on the data sources and configurations in the JSON and XML files. Monitoring the management console, availability zones,
IAM, permissions, roles, data encryption, and decryption are important for a smooth deployment.
What do I think about the stability of the solution?
Stability is supported by the availability zones, failover capabilities, and fault tolerance, which are really helpful. Regular updates, patch installations, monitoring, logging, alerting, and disaster recovery activities are crucial for maintaining stability.
What do I think about the scalability of the solution?
Scalability can be provisioned using the auto-scaling feature,
EC2 instances, on-demand instances, and storage locations like block storage, S3, or file storage. Cluster size can be managed by adding nodes automatically or manually. It is important to monitor jobs, set up alerts, and have enough details to configure a proper cluster.
How are customer service and support?
The technical support is available 24/7. We have to submit tickets, and they are usually proactive in addressing issues. They help with billing, cost determination,
IAM properties, security compliance, and deployment and migration activities.
How would you rate customer service and support?
What's my experience with pricing, setup cost, and licensing?
Compared to others, Amazon seems efficient and is considered good for Big Data workloads. Costs are involved based on cluster resources, data volumes,
EC2 instances, instance sizes,
Kubernetes,
Docker services, storage, and data transfers. Cost optimization can be achieved through instance usage, cluster sharing, and auto-scaling.
What other advice do I have?
For usage, setup, and planning are crucial. It involves configuring instances, auto-scaling, security groups, performance efficiency, labeling, storage, maintenance, disaster recovery, backups, patch installations, user permissions, roles, and logging. I rate Amazon EMR as ten out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?