AWS Glue vs Apache Parquet

Overview

Apache Parquet

Stacks98

Followers190

Votes0

AWS Glue

Stacks463

Followers819

Votes9

AWS Glue vs Apache Parquet: What are the differences?

Introduction

AWS Glue and Apache Parquet are both technologies used in the field of big data processing. While AWS Glue is a fully managed extract, transform, and load (ETL) service provided by Amazon Web Services, Apache Parquet is an open-source columnar storage file format. Although both technologies have their own unique features and benefits, there are several key differences that set them apart from each other.

Data Processing: AWS Glue is primarily used for ETL operations, allowing users to extract data from various sources, transform it according to their needs, and load it into a target destination. On the other hand, Apache Parquet is a file format optimized for columnar storage, making it suitable for analytical processing and fast query performance.
Data Storage: AWS Glue does not provide storage capabilities of its own. Instead, it allows users to work with data stored in various formats such as Amazon S3, Amazon Redshift, and more. Apache Parquet, on the other hand, is a file format that efficiently stores data in a columnar layout, providing high compression ratios and enabling efficient querying by reading only the necessary columns.
Schema Evolution: AWS Glue offers built-in schema evolution capabilities, allowing users to handle changes in data structures over time. This means that if a data source's schema changes, AWS Glue can adjust the transformation logic accordingly. In contrast, Apache Parquet has limited support for schema evolution and may require manual intervention to handle changes in schema.
Compression: AWS Glue offers multiple compression options for transforming and loading data, providing flexibility and reducing storage costs. Apache Parquet, on the other hand, natively supports compression algorithms such as Snappy, Gzip, and LZO, enabling efficient storage and retrieval of data.
Data Partitioning: AWS Glue supports data partitioning, allowing users to store data in a partitioned manner based on specific columns. This helps improve query performance by reducing the amount of data that needs to be scanned. Apache Parquet also supports data partitioning, but it is implemented at the file level rather than the column level.
Metadata Management: AWS Glue automatically generates and manages metadata for the data it processes, providing a comprehensive data catalog and enabling easy discovery and exploration of data. Apache Parquet, on the other hand, does not have built-in metadata management capabilities and relies on external tools or custom implementations for managing metadata.

In summary, AWS Glue is a fully managed ETL service focused on data extraction, transformation, and loading, while Apache Parquet is an open-source columnar storage file format optimized for analytical processing. AWS Glue provides built-in schema evolution, compression options, data partitioning, and metadata management capabilities, whereas Apache Parquet offers efficient columnar storage, limited schema evolution support, native compression options, and file-level data partitioning.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Apache Parquet, AWS Glue

Vamshi

Data Engineer at Tata Consultancy Services

May 29, 2020

Needs adviceon

PySpark

Azure Data Factory

Databricks

I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

269k views269k

Comments

datocrats-org

Jul 29, 2020

Needs adviceon

Amazon EC2

Tableau

PowerBI

We need to perform ETL from several databases into a data warehouse or data lake. We want to

keep raw and transformed data available to users to draft their own queries efficiently
give users the ability to give custom permissions and SSO
move between open-source on-premises development and cloud-based production environments

We want to use inexpensive Amazon EC2 instances only on medium-sized data set 16GB to 32GB feeding into Tableau Server or PowerBI for reporting and data analysis purposes.

319k views319k

Comments

Pavithra

Mar 12, 2020

Needs adviceon

Amazon S3

Amazon Athena

Amazon Redshift

Hi all,

Currently, we need to ingest the data from Amazon S3 to DB either Amazon Athena or Amazon Redshift. But the problem with the data is, it is in .PSV (pipe separated values) format and the size is also above 200 GB. The query performance of the timeout in Athena/Redshift is not up to the mark, too slow while compared to Google BigQuery. How would I optimize the performance and query result time? Can anyone please help me out?

522k views522k

Comments

Detailed Comparison

Apache Parquet	AWS Glue
It is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.	A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
Columnar storage format;Type-specific encoding; Pig integration; Cascading integration; Crunch integration; Apache Arrow integration; Apache Scrooge integration;Adaptive dictionary encoding; Predicate pushdown; Column stats	Easy - AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.; Integrated - AWS Glue is integrated across a wide range of AWS services.; Serverless - AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.; Developer Friendly - AWS Glue generates ETL code that is customizable, reusable, and portable, using familiar technology - Scala, Python, and Apache Spark. You can also import custom readers, writers and transformations into your Glue ETL code. Since the code AWS Glue generates is based on open frameworks, there is no lock-in. You can use it anywhere.
Statistics
Stacks 98	Stacks 463
Followers 190	Followers 819
Votes 0	Votes 9
Pros & Cons
No community feedback yet	Pros 10 Managed Hive Metastore
Integrations
Hadoop Java Apache Impala Apache Thrift Apache Hive Pig	Amazon Redshift Amazon S3 Amazon RDS Amazon Athena MySQL Microsoft SQL Server Amazon EMR Amazon Aurora Oracle Amazon RDS for PostgreSQL

What are some alternatives to Apache Parquet, AWS Glue?

MongoDB

MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.

MySQL

The MySQL software delivers a very fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software.

PostgreSQL

PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions.

Microsoft SQL Server

Microsoft® SQL Server is a database management and analysis system for e-commerce, line-of-business, and data warehousing solutions.

SQLite

SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file.

Cassandra

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

Memcached

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

MariaDB

Started by core members of the original MySQL team, MariaDB actively works with outside developers to deliver the most featureful, stable, and sanely licensed open SQL server in the industry. MariaDB is designed as a drop-in replacement of MySQL(R) with more features, new storage engines, fewer bugs, and better performance.

RethinkDB

RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn.

ArangoDB

A distributed free and open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.

Related Comparisons

AWS Glue vs Apache Parquet: What are the differences?

Introduction

Data Processing: AWS Glue is primarily used for ETL operations, allowing users to extract data from various sources, transform it according to their needs, and load it into a target destination. On the other hand, Apache Parquet is a file format optimized for columnar storage, making it suitable for analytical processing and fast query performance.
Data Storage: AWS Glue does not provide storage capabilities of its own. Instead, it allows users to work with data stored in various formats such as Amazon S3, Amazon Redshift, and more. Apache Parquet, on the other hand, is a file format that efficiently stores data in a columnar layout, providing high compression ratios and enabling efficient querying by reading only the necessary columns.
Schema Evolution: AWS Glue offers built-in schema evolution capabilities, allowing users to handle changes in data structures over time. This means that if a data source's schema changes, AWS Glue can adjust the transformation logic accordingly. In contrast, Apache Parquet has limited support for schema evolution and may require manual intervention to handle changes in schema.
Compression: AWS Glue offers multiple compression options for transforming and loading data, providing flexibility and reducing storage costs. Apache Parquet, on the other hand, natively supports compression algorithms such as Snappy, Gzip, and LZO, enabling efficient storage and retrieval of data.
Data Partitioning: AWS Glue supports data partitioning, allowing users to store data in a partitioned manner based on specific columns. This helps improve query performance by reducing the amount of data that needs to be scanned. Apache Parquet also supports data partitioning, but it is implemented at the file level rather than the column level.
Metadata Management: AWS Glue automatically generates and manages metadata for the data it processes, providing a comprehensive data catalog and enabling easy discovery and exploration of data. Apache Parquet, on the other hand, does not have built-in metadata management capabilities and relies on external tools or custom implementations for managing metadata.

AWS Glue vs Apache Parquet

Overview

AWS Glue vs Apache Parquet: What are the differences?