Apache Parquet vs Oracle

Overview

Oracle

Stacks2.6K

Followers1.8K

Votes113

Apache Parquet

Stacks99

Followers190

Votes0

Apache Parquet vs Oracle: What are the differences?

Introduction

In this Markdown code, I will provide the key differences between Apache Parquet and Oracle. Apache Parquet is a columnar storage file format, designed to work with big data processing frameworks like Apache Hadoop and Apache Spark. Oracle, on the other hand, is a widely used relational database management system.

Data Organization: Apache Parquet organizes data in a columnar format, storing values of each column separately. This allows for efficient compression and encoding techniques to be applied, resulting in better query performance and reduced IO. In contrast, Oracle organizes data in a row-based format, storing all the values of a row together. This makes it suitable for transactional processing and OLTP workloads.
Data Compression: Apache Parquet supports various compression techniques like Snappy, Gzip, and LZO, which can be chosen based on the specific requirements of the data. This helps in reducing storage space and improving query performance. Oracle also supports compression, but the options available are limited compared to Parquet.
Schema Evolution: Apache Parquet allows for schema evolution, meaning that new columns can be added to the data without affecting the existing schema. This provides flexibility in handling evolving data structures. Oracle has a more rigid schema management approach, where any changes to the schema would require altering the table structure and potentially impacting the existing data.
Query Performance: Due to its columnar storage format and efficient compression techniques, Apache Parquet provides faster query performance when dealing with large datasets. Oracle, being a traditional RDBMS, may have slower query performance when handling big data workloads compared to Parquet, especially in analytical processing scenarios.
Data Types: Apache Parquet supports a wide variety of data types, including primitive types, complex types, and nested types. This allows for storing and processing diverse data formats. Oracle also supports a wide range of data types, but the options available may be more aligned with relational database concepts.
Ecosystem Integration: Apache Parquet is well-integrated with big data processing frameworks like Apache Hadoop and Apache Spark. It is a widely adopted format in the Hadoop ecosystem, making it easier to integrate into existing data processing workflows. Oracle, being a standalone database system, may require additional configurations or connectors to integrate with big data frameworks.

In summary, Apache Parquet provides efficient columnar storage, flexible schema evolution, better query performance for big data workloads, and seamless integration with big data processing frameworks. Oracle, on the other hand, offers a more traditional row-based storage, limited compression options, stricter schema management, and may require additional setup for integration with big data ecosystems.

Share your Stack

Help developers discover the tools you use. Get visibility for your team's tech choices and contribute to the community's knowledge.

View Docs

CLI (Node.js)

Manual

Advice on Oracle, Apache Parquet

Daniel

Data Engineer at Dimensigon

Jul 18, 2020

Decided

We have chosen Tibero over Oracle because we want to offer a PL/SQL-as-a-Service that the users can deploy in any Cloud without concerns from our website at some standard cost. With Oracle Database, developers would have to worry about what they implement and the related costs of each feature but the licensing model from Tibero is just 1 price and we have all features included, so we don't have to worry and developers using our SQLaaS neither. PostgreSQL would be open source. We have chosen Tibero over Oracle because we want to offer a PL/SQL that you can deploy in any Cloud without concerns. PostgreSQL would be the open source option but we need to offer an SQLaaS with encryption and more enterprise features in the background and best value option we have found, it was Tibero Database for PL/SQL-based applications.

496k views496k

Comments

Abigail

Dec 6, 2019

Decided

In the field of bioinformatics, we regularly work with hierarchical and unstructured document data. Unstructured text data from PDFs, image data from radiographs, phylogenetic trees and cladograms, network graphs, streaming ECG data... none of it fits into a traditional SQL database particularly well. As such, we prefer to use document oriented databases.

MongoDB is probably the oldest component in our stack besides Javascript, having been in it for over 5 years. At the time, we were looking for a technology that could simply cache our data visualization state (stored in JSON) in a database as-is without any destructive normalization. MongoDB was the perfect tool; and has been exceeding expectations ever since.

Trivia fact: some of the earliest electronic medical records (EMRs) used a document oriented database called MUMPS as early as the 1960s, prior to the invention of SQL. MUMPS is still in use today in systems like Epic and VistA, and stores upwards of 40% of all medical records at hospitals. So, we saw MongoDB as something as a 21st century version of the MUMPS database.

540k views540k

Comments

Abigail

Dec 10, 2019

Decided

We wanted a JSON datastore that could save the state of our bioinformatics visualizations without destructive normalization. As a leading NoSQL data storage technology, MongoDB has been a perfect fit for our needs. Plus it's open source, and has an enterprise SLA scale-out path, with support of hosted solutions like Atlas. Mongo has been an absolute champ. So much so that SQL and Oracle have begun shipping JSON column types as a new feature for their databases. And when Fast Healthcare Interoperability Resources (FHIR) announced support for JSON, we basically had our FHIR datalake technology.

558k views558k

Comments

Detailed Comparison

Oracle	Apache Parquet
Oracle Database is an RDBMS. An RDBMS that implements object-oriented features such as user-defined types, inheritance, and polymorphism is called an object-relational database management system (ORDBMS). Oracle Database has extended the relational model to an object-relational model, making it possible to store complex business models in a relational database.	It is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
-	Columnar storage format;Type-specific encoding; Pig integration; Cascading integration; Crunch integration; Apache Arrow integration; Apache Scrooge integration;Adaptive dictionary encoding; Predicate pushdown; Column stats
Statistics
Stacks 2.6K	Stacks 99
Followers 1.8K	Followers 190
Votes 113	Votes 0
Pros & Cons
Pros 44 Reliable 33 Enterprise 15 High Availability 5 Hard to maintain 5 Expensive Cons 14 Expensive	No community feedback yet
Integrations
No integrations available	Hadoop Java Apache Impala Apache Thrift Apache Hive Pig

What are some alternatives to Oracle, Apache Parquet?

MongoDB

MongoDB stores data in JSON-like documents that can vary in structure, offering a dynamic, flexible schema. MongoDB was also designed for high availability and scalability, with built-in replication and auto-sharding.

MySQL

The MySQL software delivers a very fast, multi-threaded, multi-user, and robust SQL (Structured Query Language) database server. MySQL Server is intended for mission-critical, heavy-load production systems as well as for embedding into mass-deployed software.

PostgreSQL

PostgreSQL is an advanced object-relational database management system that supports an extended subset of the SQL standard, including transactions, foreign keys, subqueries, triggers, user-defined types and functions.

Microsoft SQL Server

Microsoft® SQL Server is a database management and analysis system for e-commerce, line-of-business, and data warehousing solutions.

SQLite

SQLite is an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file.

Cassandra

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

Memcached

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

MariaDB

Started by core members of the original MySQL team, MariaDB actively works with outside developers to deliver the most featureful, stable, and sanely licensed open SQL server in the industry. MariaDB is designed as a drop-in replacement of MySQL(R) with more features, new storage engines, fewer bugs, and better performance.

RethinkDB

RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn.

ArangoDB

A distributed free and open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient SQL-like query language or JavaScript extensions.

Related Comparisons

Apache Parquet vs Oracle: What are the differences?

Introduction

Data Organization: Apache Parquet organizes data in a columnar format, storing values of each column separately. This allows for efficient compression and encoding techniques to be applied, resulting in better query performance and reduced IO. In contrast, Oracle organizes data in a row-based format, storing all the values of a row together. This makes it suitable for transactional processing and OLTP workloads.
Data Compression: Apache Parquet supports various compression techniques like Snappy, Gzip, and LZO, which can be chosen based on the specific requirements of the data. This helps in reducing storage space and improving query performance. Oracle also supports compression, but the options available are limited compared to Parquet.
Schema Evolution: Apache Parquet allows for schema evolution, meaning that new columns can be added to the data without affecting the existing schema. This provides flexibility in handling evolving data structures. Oracle has a more rigid schema management approach, where any changes to the schema would require altering the table structure and potentially impacting the existing data.
Query Performance: Due to its columnar storage format and efficient compression techniques, Apache Parquet provides faster query performance when dealing with large datasets. Oracle, being a traditional RDBMS, may have slower query performance when handling big data workloads compared to Parquet, especially in analytical processing scenarios.
Data Types: Apache Parquet supports a wide variety of data types, including primitive types, complex types, and nested types. This allows for storing and processing diverse data formats. Oracle also supports a wide range of data types, but the options available may be more aligned with relational database concepts.
Ecosystem Integration: Apache Parquet is well-integrated with big data processing frameworks like Apache Hadoop and Apache Spark. It is a widely adopted format in the Hadoop ecosystem, making it easier to integrate into existing data processing workflows. Oracle, being a standalone database system, may require additional configurations or connectors to integrate with big data frameworks.

Apache Parquet vs Oracle

Overview

Apache Parquet vs Oracle: What are the differences?