The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.
Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).
At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.
For more info:
#DataScience #DataStack #Data
We used to Mocha for as our primary Node.js test framework. We've now switched to Jest and haven't looked back.
Jest is faster and requires less setup and configuration. The Mocha API and eco-system is vast and verified, but that also brings complexity.
It you want to get in, write tests, execute them and get out, try Jest 😀
When creating the web infrastructure for our start-up, I wanted to host our app on a PaaS to get started quickly.
A very popular one for Rails is Heroku, which I love for free hobby side projects, but never used professionally. On the other hand, I was very familiar with the AWS ecosystem, and since I was going to use some of its services anyways, I thought: why not go all in on it?
It turns out that Amazon offers a PaaS called AWS Elastic Beanstalk, which is basically like an “AWS Heroku”. It even comes with a similar command-line utility, called "eb”. While edge-case Rails problems are not as well documented as with Heroku, it was very satisfying to manage all our cloud services under the same AWS account. There are auto-scaling options for web and worker instances, which is a nice touch. Overall, it was reliable, and I would recommend it to anyone planning on heavily using AWS.
I use Visual Studio because it provides me best default configuration for development. Less choice helps me concentrate on the product. In a sense it is iPhone of software development for me. When my laptop broke, I just download latest version of VS and start coding without any configuration. For sure it has best editor in terms of perceived responsiveness. Could not say the same for IntelliJ IDEA unfortunately.