Stitch Fix

Stitch Fix

33 Followers
Stitch Fix is a personal styling platform that delivers curated and personalized apparel and accessory items of perfect fit ... more

Decisions 4

Eric Colson

Chief Algorithms Officer at Stitch Fix

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

19 342.7K

Patrick Sun

Software Engineer at Stitch Fix

Elasticsearch's built-in visualization tool, Kibana, is robust and the appropriate tool in many cases. However, it is geared specifically towards log exploration and time-series data, and we felt that its steep learning curve would impede adoption rate among data scientists accustomed to writing SQL. The solution was to create something that would replicate some of Kibana's essential functionality while hiding Elasticsearch's complexity behind SQL-esque labels and terminology ("table" instead of "index", "group by" instead of "sub-aggregation") in the UI.

Elasticsearch's API is really well-suited for aggregating time-series data, indexing arbitrary data without defining a schema, and creating dashboards. For the purpose of a data exploration backend, Elasticsearch fits the bill really well. Users can send an HTTP request with aggregations and sub-aggregations to an index with millions of documents and get a response within seconds, thus allowing them to rapidly iterate through their data.

11 30.7K

Patrick Sun

Software Engineer at Stitch Fix

As a frontend engineer on the Algorithms & Analytics team at Stitch Fix, I work with data scientists to develop applications and visualizations to help our internal business partners make data-driven decisions. I envisioned a platform that would assist data scientists in the data exploration process, allowing them to visually explore and rapidly iterate through their assumptions, then share their insights with others. This would align with our team's philosophy of having engineers "deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy", and solve the pain of data exploration.

The final product, code-named Dora, is built with React, Redux.js and Victory, backed by Elasticsearch to enable fast and iterative data exploration, and uses Apache Spark to move data from our Amazon S3 data warehouse into the Elasticsearch cluster.

10 28.3K

Patrick Sun

Software Engineer at Stitch Fix

To load data from our Amazon S3 data warehouse into the Elasticsearch cluster, I developed a Spark application that uses PySpark to extract data from S3, partition, then batch-send each partition to Elasticsearch to increase parallelism. The Spark job enables fielddata: true for text columns with low cardinality to allow sub-aggregations by text columns and prevents data duplication by adding a unique _id field to each row in the dataframe.

The job can then be run by data scientists in Flotilla, an internal data platform tool for running jobs on Amazon EC2 Container Service, with environment variables specifying which schema and table to load.

7 5.5K

Followers 33

ํƒœํ™” ์ •
Thuy Tran
pchj
Yonas Beshawred
Tomasz Bak
Zheng Wang
Peter Zonneveld
Miguel Castilla
Jack Lam
Jason Chen
Roger Noble
Gil Perez
Wesley Jefferson
Artur Daveyan
vic bravo
Elyor Latipov
Lito Santos
VANNINA MI
jmstone617
Mandy Scott
Jeremy Bauer
Jignesh Suvariya
Yosef Benny Widyokarsono
Sourav Basu
DougEverly
Patrick Sun
Neelesh Salian