I use the solution in my company for building datalake and for a variety of data sources like Oracle, MongoDB, and other multiple data sources, like SQL server, and AWS S3 buckets as a datalake storage tool, and then further we use AWS Glue to process it and move to AWS' search engine which will be like a lakehouse solution.
AWS Glue handles a huge volume of data, and it is a serverless tool. We don't need to put extra servers as long as the job runs or goes on with AWS Glue. Maintaining other options like AWS EMR can be very costly when handling the company's functions. AWS Glue creates a data catalog.
The drawbacks associated with the product stem from the fact that, based on the data volume, it can become very costly. There is a huge cost if the source system is not properly designed. If the changes are frequent and not valid, then, initially, you will use huge amounts of data in the ETL. The biggest challenges are associated with AWS Glue's costs, and it takes one-third of my entire pipeline cost.
I have been using AWS Glue for two years. I am a customer of the tool.
It is a very scalable solution. The problem with the tool is that it has a huge set of options, so there are a lot of hidden costs involved. Suppose you have to get into data quality, the billing needs to be done for the data quality part. As per what I learned, there is a need to scrape off AWS Glue and look for other solutions.
We use Amazon's services to provide technical support for the product. If you want to have support, Oracle and others offer a single support, and other tools have a direct support window. For Amazon, we need to pay 10 percent of my billing amount for the tool to get support services. Whether to raise a support ticket or not is an issue since ten percent is a huge amount. My company ends up using all the options without help from support. It is very difficult for any common man to understand why there is a need to pay ten percent for support. If I find an issue in the product, and I need to get support from AWS to fix it, then I need to pay ten percent of the tool's bill amount to Amazon. AWS is a very tricky tool because everything is evolving nowadays. AWS engineers are getting hired from other places, and even after that, if I am not getting any technical support, then things will be very nasty. There are some good engineers who help users outside the normal support cycle, but it doesn't meet their needs.
I rate the technical support a four out of ten.
The product's installation phase is easy, especially since it is serverless.
The solution can be deployed in a day or two since it is serverless. AWS Glue alone doesn't work perfectly. If we have the right data model, then we can use AWS Glue. Inside AWS Glue, we use PySpark.
My company is in the mode of scraping off AWS Glue. My company is not approving the budget for us to use AWS Glue. I am trying to see some solutions that are not costly, like the ClickHouse database, which is an open-source tool.
The costs of the tool are huge, especially when moving from the source to the datalake.
I rate the tool an eight on a scale of one to ten, where one is expensive, and ten is expensive. I cannot have any predictability factor regarding the costs associated with the tool.
The main piece of AWS Glue is the ETL part. AWS Glue is for ETL to deal with S3 data sources to Redshift. We use AWS Glue for the CDC.
As the product is serverless, the tool runs fine. Most of the maintenance and monitoring are among the biggest challenges of the tool.
In terms of the product's ability to handle data volumes during scaling needs, I would say that though it offers the area of data volumes, the challenges are associated with costs.
The latency would be there if the source had a huge amount of data coming in, and so based on it, it would read the source system sequentially because of the way the CDC works. If I need to capture the change in the source change in the order, it can happen, and if you have a better network, you can also scale up by bringing the source to S3 or AWS Glue. When you can scale up, it is not really relevant for the group. The latency is not because of AWS Glue but because when it comes to ETL or CDC, I need to process it the same way I do it with AWS Glue. I cannot do parallel processing, and I need to do it sequentially.
I don't see any AI capabilities in the product, and it is more of an ETL solution.
As the product has many problems, people are moving to Bare Metal and other cloud services. Our company has spent a lot of time investigating what AWS Glue does, including the time required to use it to maintain the servers.
I need to spend on the product's maintenance along with the other activities for which I need to make payments to use the solution. Once you are able to predict the data volumes and other factors that are there over the cloud, it is possible to predict what my server will cost for the next five years and then get the servers at a very low price instead of depending on AWS.
Though it is a good solution, it is not cost-sensitive.
I rate the tool a six out of ten.