AWS Glue vs Impala: What are the differences?
What is AWS Glue? Fully managed extract, transform, and load (ETL) service. A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
What is Impala? Real-time Query for Hadoop. Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.
AWS Glue and Impala belong to "Big Data Tools" category of the tech stack.
Some of the features offered by AWS Glue are:
- Easy - AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
- Integrated - AWS Glue is integrated across a wide range of AWS services.
- Serverless - AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.
On the other hand, Impala provides the following key features:
- Do BI-style Queries on Hadoop
- Unify Your Infrastructure
- Implement Quickly
Impala is an open source tool with 2.17K GitHub stars and 825 GitHub forks. Here's a link to Impala's open source repository on GitHub.
37 Signals, Stripe, and Expedia.com are some of the popular companies that use Impala, whereas AWS Glue is used by Auto Trader, Postmates, and SparkPost. Impala has a broader approval, being mentioned in 15 company stacks & 5 developers stacks; compared to AWS Glue, which is listed in 12 company stacks and 7 developer stacks.