Decision at Stitch Fix about Amazon EC2 Container Service, Elasticsearch, Amazon S3

Avatar of psunnn
Software Engineer at Stitch Fix ·

To load data from our Amazon S3 data warehouse into the Elasticsearch cluster, I developed a Spark application that uses PySpark to extract data from S3, partition, then batch-send each partition to Elasticsearch to increase parallelism. The Spark job enables fielddata: true for text columns with low cardinality to allow sub-aggregations by text columns and prevents data duplication by adding a unique _id field to each row in the dataframe.

The job can then be run by data scientists in Flotilla, an internal data platform tool for running jobs on Amazon EC2 Container Service, with environment variables specifying which schema and table to load.

7 upvotes·2.1K views
Avatar of Patrick Sun

Patrick Sun

Software Engineer at Stitch Fix