Insights on Amazon S3, Elasticsearch and Amazon EC2 Container Service at Stitch Fix

Patrick Sun

Software Engineer at Stitch Fix·Sep 13, 2018

Shared insights

Amazon S3

Elasticsearch

Amazon EC2 Container Service

Stitch Fix

To load data from our Amazon S3 data warehouse into the Elasticsearch cluster, I developed a Spark application that uses PySpark to extract data from S3, partition, then batch-send each partition to Elasticsearch to increase parallelism. The Spark job enables fielddata: true for text columns with low cardinality to allow sub-aggregations by text columns and prevents data duplication by adding a unique _id field to each row in the dataframe.

The job can then be run by data scientists in Flotilla, an internal data platform tool for running jobs on Amazon EC2 Container Service, with environment variables specifying which schema and table to load.

READ LESS

Building a Data Exploration Tool with React, Redux, Victory, and Elasticsearch - Stitch Fix Tech Stack | StackShare (stackshare.io)

8 upvotes·17.8K views