Feed powered byStream Blue Logo Copy 5Created with Sketch.
Stitch Fix

Decision at Stitch Fix about Elasticsearch, Kibana

Avatar of psunnn
Software Engineer at Stitch Fix
ElasticsearchElasticsearch
KibanaKibana

Elasticsearch's built-in visualization tool, Kibana, is robust and the appropriate tool in many cases. However, it is geared specifically towards log exploration and time-series data, and we felt that its steep learning curve would impede adoption rate among data scientists accustomed to writing SQL. The solution was to create something that would replicate some of Kibana's essential functionality while hiding Elasticsearch's complexity behind SQL-esque labels and terminology ("table" instead of "index", "group by" instead of "sub-aggregation") in the UI.

Elasticsearch's API is really well-suited for aggregating time-series data, indexing arbitrary data without defining a schema, and creating dashboards. For the purpose of a data exploration backend, Elasticsearch fits the bill really well. Users can send an HTTP request with aggregations and sub-aggregations to an index with millions of documents and get a response within seconds, thus allowing them to rapidly iterate through their data.

10 upvotes128 views

Decision at Stitch Fix about Apache Spark, Victory, Amazon S3, Elasticsearch, Redux.js, React

Avatar of psunnn
Software Engineer at Stitch Fix
Apache SparkApache Spark
VictoryVictory
Amazon S3Amazon S3
ElasticsearchElasticsearch
Redux.jsRedux.js
ReactReact

As a frontend engineer on the Algorithms & Analytics team at Stitch Fix, I work with data scientists to develop applications and visualizations to help our internal business partners make data-driven decisions. I envisioned a platform that would assist data scientists in the data exploration process, allowing them to visually explore and rapidly iterate through their assumptions, then share their insights with others. This would align with our team's philosophy of having engineers "deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy", and solve the pain of data exploration.

The final product, code-named Dora, is built with React, Redux.js and Victory, backed by Elasticsearch to enable fast and iterative data exploration, and uses Apache Spark to move data from our Amazon S3 data warehouse into the Elasticsearch cluster.

9 upvotes1.8K views

Decision at Stitch Fix about Amazon EC2 Container Service, Elasticsearch, Amazon S3

Avatar of psunnn
Software Engineer at Stitch Fix
Amazon EC2 Container ServiceAmazon EC2 Container Service
ElasticsearchElasticsearch
Amazon S3Amazon S3

To load data from our Amazon S3 data warehouse into the Elasticsearch cluster, I developed a Spark application that uses PySpark to extract data from S3, partition, then batch-send each partition to Elasticsearch to increase parallelism. The Spark job enables fielddata: true for text columns with low cardinality to allow sub-aggregations by text columns and prevents data duplication by adding a unique _id field to each row in the dataframe.

The job can then be run by data scientists in Flotilla, an internal data platform tool for running jobs on Amazon EC2 Container Service, with environment variables specifying which schema and table to load.

7 upvotes163 views