What is Apache Spark?
Who uses Apache Spark?
Apache Spark Integrations
Why developers like Apache Spark?
Here are some stack decisions, common use cases and reviews by companies and developers who chose Apache Spark in their tech stack.
The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.
Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).
At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.
For more info:
- Our Algorithms Tour: https://algorithms-tour.stitchfix.com/
- Our blog: https://multithreaded.stitchfix.com/blog/
- Careers: https://multithreaded.stitchfix.com/careers/
#DataScience #DataStack #Data
How Uber developed the open source, end-to-end distributed tracing Jaeger , now a CNCF project:
Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second.
Here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve:
As a frontend engineer on the Algorithms & Analytics team at Stitch Fix, I work with data scientists to develop applications and visualizations to help our internal business partners make data-driven decisions. I envisioned a platform that would assist data scientists in the data exploration process, allowing them to visually explore and rapidly iterate through their assumptions, then share their insights with others. This would align with our team's philosophy of having engineers "deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy", and solve the pain of data exploration.
The final product, code-named Dora, is built with React, Redux.js and Victory, backed by Elasticsearch to enable fast and iterative data exploration, and uses Apache Spark to move data from our Amazon S3 data warehouse into the Elasticsearch cluster.
Artigo que introduz como estamos aplicando Apache Spark em projetos de Machine Learning.
Nesse artigo podemos ter uma visão de alto nível de como o Spark funciona, quais são as APIs disponíveis, o que cada uma delas se propõe a fazer e como configurá-lo e programar usando PySpark no Google Colab.
Sugiro que você dê uma lida na documentação do PySpark.sql e tente criar algumas análises diferentes da base de dados. Tente entender melhor como a API funciona e como o processamento é feito, só não vale usar Pandas, ok? Salvei o notebook com todos os comandos em . Basta fazer o download, subir para o Google Colab e brincar com os comandos.
I use Apache Spark because it is THE framework for big data processing from big tech to startup. It can be run on pretty much any platform. It's open source, and lots of community support and code samples to draw from.
The Python API is good for low-med level transformations, but most recommend starting with Scala/Java to use full spark capabilities.
It comes with quite learning curve to make sense of how data is shuffling through different nodes, but it's worth it for running large-scale ETL.
Also, keep in mind the streaming and batch frameworks are not unified, so you'll have learn them both separately.
In mid-2015, Uber began exploring ways to scale ML across the organization, avoiding ML anti-patterns while standardizing workflows and tools. This effort led to Michelangelo.
Michelangelo consists of a mix of open source systems and components built in-house. The primary open sourced components used are HDFS, Spark, Samza, Cassandra, MLLib, XGBoost, and TensorFlow.
Apache Spark's Features
- Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
- Write applications quickly in Java, Scala or Python
- Combine SQL, streaming, and complex analytics
- Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3