Need advice about which tool to choose?Ask the StackShare community!

Apache Parquet

Stacks91

Followers185

+ 1

Votes0

JSON

Stacks1.9K

Followers1.6K

+ 1

Votes9

Add tool

Apache Parquet vs JSON: What are the differences?

Introduction

Apache Parquet and JSON are both file formats used for storing and exchanging data. However, there are several key differences between the two that make them suitable for different use cases. In the following paragraphs, we will explore these differences in detail.

Schema-based vs. Schema-less: One of the major differences between Apache Parquet and JSON is their approach to data schema. Parquet is a schema-based file format, which means it requires a predefined schema that specifies the structure of the data. On the other hand, JSON is a schema-less format, allowing for more flexibility as data can be stored without a predefined schema.
Compression: Another key difference between Parquet and JSON is the way they handle data compression. Parquet uses columnar compression, which compresses each column independently. This allows for high compression ratios and efficient query performance, especially in scenarios where only a subset of columns needs to be read. JSON, on the other hand, does not provide built-in compression and the data is usually stored in a verbose manner, leading to larger file sizes.
Data Types: When it comes to data types, Parquet supports a wider range of data types compared to JSON. Parquet includes support for complex data types like arrays, maps, and nested structures, whereas JSON has limited support for these types. JSON primarily relies on string, numeric, boolean, and null types for data representation.
Query Performance: Due to its columnar storage and compression techniques, Parquet generally offers better query performance compared to JSON. Parquet allows for efficient column pruning, where only the required columns are read during query execution, leading to faster data retrieval. JSON, on the other hand, requires parsing the entire document to retrieve specific fields, which can result in slower query performance.
Serialization: Apache Parquet uses a binary format for serialization, which provides a compact representation of data and makes it suitable for use in distributed systems. JSON, being a text-based format, has a larger footprint and may require additional parsing during serialization and deserialization.
Tooling Support: Parquet has extensive tooling support in the Apache Hadoop ecosystem, making it easier to integrate with existing big data processing frameworks like Apache Spark and Apache Hive. JSON, being a widely adopted and simple format, also has good tooling support across various programming languages and platforms.

In summary, Apache Parquet and JSON differ in their approach to data schema, compression techniques, supported data types, query performance, serialization format, and tooling support. Choosing between the two formats depends on the specific requirements of the use case, with Parquet providing better performance and efficiency for structured data, while JSON offers flexibility and simplicity for schema-less data storage.

Advice on Apache Parquet and JSON

Dhinesh Ram

architect · Jun 16, 2020 | 7 upvotes · 311.1K views

Needs advice

JSON

and

Python

Hi. Currently, I have a requirement where I have to create a new JSON file based on the input CSV file, validate the generated JSON file, and upload the JSON file into the application (which runs in AWS) using API. Kindly suggest the best language that can meet the above requirement. I feel Python will be better, but I am not sure with the justification of why python. Can you provide your views on this?

Replies (3)

Nick Butlin

Jul 10, 2020 | 3 upvotes · 293.4K views

Recommends

Python

Python is very flexible and definitely up the job (although, in reality, any language will be able to cope with this task!). Python has some good libraries built in, and also some third party libraries that will help here. 1. Convert CSV -> JSON 2. Validate against a schema 3. Deploy to AWS

The builtins include json and csv libraries, and, depending on the complexity of the csv file, it is fairly simple to convert:

import csv
import json

with open("your_input.csv", "r") as f:
    csv_as_dict = list(csv.DictReader(f))[0]

with open("your_output.json", "w") as f:
    json.dump(csv_as_dict, f)

The validation part is handled nicely by this library: https://pypi.org/project/jsonschema/ It allows you to create a schema and check whether what you have created works for what you want to do. It is based on the json schema standard, allowing annotation and validation of any json
It as an AWS library to automate the upload - or in fact do pretty much anything with AWS - from within your codebase: https://aws.amazon.com/sdk-for-python/ This will handle authentication to AWS and uploading / deploying the file to wherever it needs to go.

A lot depends on the last two pieces, but the converting itself is really pretty neat.

Max Musing

Founder & CEO at BaseDash · Jul 9, 2020 | 1 upvotes · 291.2K views

Recommends

Node.js

BaseDash

This should be pretty doable in any language. Go with whatever you're most familiar with.

That being said, there's a case to be made for using Node.js since it's trivial to convert an object to JSON and vice versa.

Doug Schwartz

Jul 10, 2020 | 1 upvotes · 291.2K views

Recommends

Golang

I would use Go. Since CSV files are flat (no hierarchy), you could use the encoding/csv package to read each row, and write out the values as JSON. See https://medium.com/@ankurraina/reading-a-simple-csv-in-go-36d7a269cecd. You just have to figure out in advance what the key is for each row.

Get Advice from developers at your company using StackShare Enterprise. Sign up for StackShare Enterprise.

Learn More

Pros of Apache Parquet

Pros of JSON

Be the first to leave a pro

5
Simple
4
Widely supported

Sign up to add or upvote prosMake informed product decisions

No Stats

- No public GitHub repository available -

What is Apache Parquet?

It is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

What is JSON?

JavaScript Object Notation is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language.

Need advice about which tool to choose?Ask the StackShare community!

Jobs that mention Apache Parquet and JSON as a desired skillset

Manager I, Site Reliability Engineering

San Francisco, CA, US; , CA, US

View Job Details

+18

Manager I, Site Reliability Engineering

San Francisco, CA, US; , CA, US

View Job Details

+18

Senior Software Engineer, Big Data

San Francisco, CA, US; , CA, US

View Job Details

+11

Staff Software Engineer - Site Reliability

Toronto, ON, CA

View Job Details

+20

Staff Software Engineer - Site Reliability

Toronto, ON, CA

View Job Details

+20

Staff Software Engineer - Site Reliability

Toronto, ON, CA

View Job Details

+20

Sr. Machine Learning Engineer

Toronto, ON, CA

View Job Details

Staff Software Engineer, Ads Serving Platform

San Francisco, CA, US; , US

View Job Details

See jobs for Apache Parquet

See jobs for JSON

What companies use Apache Parquet?

What companies use JSON?

See which teams inside your own company are using Apache Parquet or JSON.

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Apache Parquet?

What tools integrate with JSON?

Sign up to get full access to all the tool integrationsMake informed product decisions

Blog Posts

How We Designed Our Continuous Integration System to be More T...

Mar 3 2021 at 5:37PM

4468

Pinterest Visual Signals Infrastructure: Evolution from Lambda...

Nov 24 2020 at 7:01PM

2441

Cultivating your Data Lake

Aug 28 2019 at 3:10AM

Segment

+16

2559

What are some alternatives to Apache Parquet and JSON?

Avro

It is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.

Apache Kudu

A new addition to the open source Apache Hadoop ecosystem, Kudu completes Hadoop's storage layer to enable fast analytics on fast data.

Cassandra

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

HBase

Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop.

JavaScript

JavaScript is most known as the scripting language for Web pages, but used in many non-browser environments as well such as node.js or Apache CouchDB. It is a prototype-based, multi-paradigm scripting language that is dynamic,and supports object-oriented, imperative, and functional programming styles.

See all alternatives

Apache Parquet vs JSON

Need advice about which tool to choose?Ask the StackShare community!

Apache Parquet vs JSON: What are the differences?

Pros of Apache Parquet

Pros of JSON

Sign up to add or upvote prosMake informed product decisions

What is Apache Parquet?

What is JSON?

Need advice about which tool to choose?Ask the StackShare community!

What companies use Apache Parquet?

What companies use JSON?

Sign up to get full access to all the companiesMake informed product decisions

What tools integrate with Apache Parquet?

What tools integrate with JSON?

Sign up to get full access to all the tool integrationsMake informed product decisions

Blog Posts

Related Comparisons

Trending Comparisons

Top Comparisons