StackShareStackShare
Follow on
StackShare

Discover and share technology stacks from companies around the world.

Follow on

© 2025 StackShare. All rights reserved.

Product

  • Stacks
  • Tools
  • Feed

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  1. Stackups
  2. Application & Data
  3. Databases
  4. Databases
  5. AWS Glue vs Apache Parquet

AWS Glue vs Apache Parquet

OverviewDecisionsComparisonAlternatives

Overview

Apache Parquet
Apache Parquet
Stacks97
Followers190
Votes0
AWS Glue
AWS Glue
Stacks462
Followers819
Votes9

AWS Glue vs Apache Parquet: What are the differences?

Introduction

AWS Glue and Apache Parquet are both technologies used in the field of big data processing. While AWS Glue is a fully managed extract, transform, and load (ETL) service provided by Amazon Web Services, Apache Parquet is an open-source columnar storage file format. Although both technologies have their own unique features and benefits, there are several key differences that set them apart from each other.

  1. Data Processing: AWS Glue is primarily used for ETL operations, allowing users to extract data from various sources, transform it according to their needs, and load it into a target destination. On the other hand, Apache Parquet is a file format optimized for columnar storage, making it suitable for analytical processing and fast query performance.

  2. Data Storage: AWS Glue does not provide storage capabilities of its own. Instead, it allows users to work with data stored in various formats such as Amazon S3, Amazon Redshift, and more. Apache Parquet, on the other hand, is a file format that efficiently stores data in a columnar layout, providing high compression ratios and enabling efficient querying by reading only the necessary columns.

  3. Schema Evolution: AWS Glue offers built-in schema evolution capabilities, allowing users to handle changes in data structures over time. This means that if a data source's schema changes, AWS Glue can adjust the transformation logic accordingly. In contrast, Apache Parquet has limited support for schema evolution and may require manual intervention to handle changes in schema.

  4. Compression: AWS Glue offers multiple compression options for transforming and loading data, providing flexibility and reducing storage costs. Apache Parquet, on the other hand, natively supports compression algorithms such as Snappy, Gzip, and LZO, enabling efficient storage and retrieval of data.

  5. Data Partitioning: AWS Glue supports data partitioning, allowing users to store data in a partitioned manner based on specific columns. This helps improve query performance by reducing the amount of data that needs to be scanned. Apache Parquet also supports data partitioning, but it is implemented at the file level rather than the column level.

  6. Metadata Management: AWS Glue automatically generates and manages metadata for the data it processes, providing a comprehensive data catalog and enabling easy discovery and exploration of data. Apache Parquet, on the other hand, does not have built-in metadata management capabilities and relies on external tools or custom implementations for managing metadata.

In summary, AWS Glue is a fully managed ETL service focused on data extraction, transformation, and loading, while Apache Parquet is an open-source columnar storage file format optimized for analytical processing. AWS Glue provides built-in schema evolution, compression options, data partitioning, and metadata management capabilities, whereas Apache Parquet offers efficient columnar storage, limited schema evolution support, native compression options, and file-level data partitioning.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs
CLI (Node.js)
or
Manual

Advice on Apache Parquet, AWS Glue

Vamshi
Vamshi

Data Engineer at Tata Consultancy Services

May 29, 2020

Needs adviceonPySparkPySparkAzure Data FactoryAzure Data FactoryDatabricksDatabricks

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

269k views269k
Comments
datocrats-org
datocrats-org

Jul 29, 2020

Needs adviceonAmazon EC2Amazon EC2TableauTableauPowerBIPowerBI

We need to perform ETL from several databases into a data warehouse or data lake. We want to

  • keep raw and transformed data available to users to draft their own queries efficiently
  • give users the ability to give custom permissions and SSO
  • move between open-source on-premises development and cloud-based production environments

We want to use inexpensive Amazon EC2 instances only on medium-sized data set 16GB to 32GB feeding into Tableau Server or PowerBI for reporting and data analysis purposes.

319k views319k
Comments
Pavithra
Pavithra

Mar 12, 2020

Needs adviceonAmazon S3Amazon S3Amazon AthenaAmazon AthenaAmazon RedshiftAmazon Redshift

Hi all,

Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?

522k views522k
Comments

Detailed Comparison

Apache Parquet
Apache Parquet
AWS Glue
AWS Glue

It is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.

Columnar storage format;Type-specific encoding; Pig integration; Cascading integration; Crunch integration; Apache Arrow integration; Apache Scrooge integration;Adaptive dictionary encoding; Predicate pushdown; Column stats
Easy - AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.; Integrated - AWS Glue is integrated across a wide range of AWS services.; Serverless - AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.; Developer Friendly - AWS Glue generates ETL code that is customizable, reusable, and portable, using familiar technology - Scala, Python, and Apache Spark. You can also import custom readers, writers and transformations into your Glue ETL code. Since the code AWS Glue generates is based on open frameworks, there is no lock-in. You can use it anywhere.
Statistics
Stacks
97
Stacks
462
Followers
190
Followers
819
Votes
0
Votes
9
Pros & Cons
No community feedback yet
Pros
  • 9
    Managed Hive Metastore
Integrations
Hadoop
Hadoop
Java
Java
Apache Impala
Apache Impala
Apache Thrift
Apache Thrift
Apache Hive
Apache Hive
Pig
Pig
Amazon Redshift
Amazon Redshift
Amazon S3
Amazon S3
Amazon RDS
Amazon RDS
Amazon Athena
Amazon Athena
MySQL
MySQL
Microsoft SQL Server
Microsoft SQL Server
Amazon EMR
Amazon EMR
Amazon Aurora
Amazon Aurora
Oracle
Oracle
Amazon RDS for PostgreSQL
Amazon RDS for PostgreSQL

What are some alternatives to Apache Parquet, AWS Glue?

MongoDB

MongoDB

MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.

MySQL

MySQL

The MySQL software delivers a very fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software.

PostgreSQL

PostgreSQL

PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions.

Microsoft SQL Server

Microsoft SQL Server

Microsoft® SQL Server is a database management and analysis system for e-commerce, line-of-business, and data warehousing solutions.

SQLite

SQLite

SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file.

Cassandra

Cassandra

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

Memcached

Memcached

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

MariaDB

MariaDB

Started by core members of the original MySQL team, MariaDB actively works with outside developers to deliver the most featureful, stable, and sanely licensed open SQL server in the industry. MariaDB is designed as a drop-in replacement of MySQL(R) with more features, new storage engines, fewer bugs, and better performance.

RethinkDB

RethinkDB

RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn.

ArangoDB

ArangoDB

A distributed free and open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.

Related Comparisons

Bootstrap
Materialize

Bootstrap vs Materialize

Laravel
Django

Django vs Laravel vs Node.js

Bootstrap
Foundation

Bootstrap vs Foundation vs Material UI

Node.js
Spring Boot

Node.js vs Spring-Boot

Liquibase
Flyway

Flyway vs Liquibase