What is Python?
Who uses Python?
Why developers like Python?
Here are some stack decisions, common use cases and reviews by companies and developers who chose Python in their tech stack.
Winds 2.0 is an open source Podcast/RSS reader developed by Stream with a core goal to enable a wide range of developers to contribute.
Recently we were looking at a few robust and cost-effective ways of replicating the data that resides in our production MongoDB to a PostgreSQL database for data warehousing and business intelligence.
We set ourselves the following criteria for the optimal tool that would do this job: - The data replication must be near real-time, yet it should NOT impact the production database - The data replication must be horizontally scalable (based on the load), asynchronous & crash-resilient
Based on the above criteria, we selected the following tools to perform the end to end data replication:
We chose MongoDB Stitch for picking up the changes in the source database. It is the serverless platform from MongoDB. One of the services offered by MongoDB Stitch is Stitch Triggers. Using stitch triggers, you can execute a serverless function (in Node.js) in real time in response to changes in the database. When there are a lot of database changes, Stitch automatically "feeds forward" these changes through an asynchronous queue.
We chose Amazon SQS as the pipe / message backbone for communicating the changes from MongoDB to our own replication service. Interestingly enough, MongoDB stitch offers integration with AWS services.
In the Node.js function, we wrote minimal functionality to communicate the database changes (insert / update / delete / replace) to Amazon SQS.
Next we wrote a minimal micro-service in Python to listen to the message events on SQS, pickup the data payload & mirror the DB changes on to the target Data warehouse. We implemented source data to target data translation by modelling target table structures through SQLAlchemy . We deployed this micro-service as AWS Lambda with Zappa. With Zappa, deploying your services as event-driven & horizontally scalable Lambda service is dumb-easy.
In the end, we got to implement a highly scalable near realtime Change Data Replication service that "works" and deployed to production in a matter of few days!
Simple controls over complex technologies, as we put it, wouldn't be possible without neat UIs for our user areas including start page, dashboard, settings, and docs.
Initially, there was Django. Back in 2011, considering our Python-centric approach, that was the best choice. Later, we realized we needed to iterate on our website more quickly. And this led us to detaching Django from our front end. That was when we decided to build an SPA.
For building user interfaces, we're currently using React as it provided the fastest rendering back when we were building our toolkit. It’s worth mentioning Uploadcare is not a front-end-focused SPA: we aren’t running at high levels of complexity. If it were, we’d go with Ember.js.
However, there's a chance we will shift to the faster Preact, with its motto of using as little code as possible, and because it makes more use of browser APIs. One of our future tasks for our front end is to configure our Webpack bundler to split up the code for different site sections. For styles, we use PostCSS along with its plugins such as cssnano which minifies all the code.
All that allows us to provide a great user experience and quickly implement changes where they are needed with as little code as possible.
The 350M API requests we handle daily include many processing tasks such as image enhancements, resizing, filtering, face recognition, and GIF to video conversions.
Tornado is the one we currently use and aiohttp is the one we intend to implement in production in the near future. Both tools support handling huge amounts of requests but aiohttp is preferable as it uses asyncio which is Python-native. Since Python is in the heart of our service, we initially used PIL followed by Pillow. We kind of still do. When we figured resizing was the most taxing processing operation, Alex, our engineer, created the fork named Pillow-SIMD and implemented a good number of optimizations into it to make it 15 times faster than ImageMagick
Thanks to the optimizations, Uploadcare now needs six times fewer servers to process images. Here, by servers I also mean separate Amazon EC2 instances handling processing and the first layer of caching. The processing instances are also paired with AWS Elastic Load Balancing (ELB) which helps ingest files to the CDN.
The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.
Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).
At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.
For more info:
- Our Algorithms Tour: https://algorithms-tour.stitchfix.com/
- Our blog: https://multithreaded.stitchfix.com/blog/
- Careers: https://multithreaded.stitchfix.com/careers/
#DataScience #DataStack #Data
Sentry's event processing pipeline, which is responsible for handling all of the ingested event data that makes it through to our offline task processing, is written primarily in Python.
For particularly intense code paths, like our source map processing pipeline, we have begun re-writing those bits in Rust. Rust’s lack of garbage collection makes it a particularly convenient language for embedding in Python. It allows us to easily build a Python extension where all memory is managed from the Python side (if the Python wrapper gets collected by the Python GC we clean up the Rust object as well).