Apache Spark

Apache Spark

Application and Data / Data Stores / Big Data Tools

Decision at Stitch Fix about Amazon EC2 Container Service, Docker, PyTorch, R, Python, Presto, Apache Spark, Amazon S3, PostgreSQL, Kafka, Data, DataStack, DataScience, ML, Etl, AWS

Avatar of ecolson
Chief Algorithms Officer at Stitch Fix ·

The algorithms and data infrastructure at Stitch Fix is housed in #AWS. Data acquisition is split between events flowing through Kafka, and periodic snapshots of PostgreSQL DBs. We store data in an Amazon S3 based data warehouse. Apache Spark on Yarn is our tool of choice for data movement and #ETL. Because our storage layer (s3) is decoupled from our processing layer, we are able to scale our compute environment very elastically. We have several semi-permanent, autoscaling Yarn clusters running to serve our data processing needs. While the bulk of our compute infrastructure is dedicated to algorithmic processing, we also implemented Presto for adhoc queries and dashboards.

Beyond data movement and ETL, most #ML centric jobs (e.g. model training and execution) run in a similarly elastic environment as containers running Python and R code on Amazon EC2 Container Service clusters. The execution of batch jobs on top of ECS is managed by Flotilla, a service we built in house and open sourced (see https://github.com/stitchfix/flotilla-os).

At Stitch Fix, algorithmic integrations are pervasive across the business. We have dozens of data products actively integrated systems. That requires serving layer that is robust, agile, flexible, and allows for self-service. Models produced on Flotilla are packaged for deployment in production using Khan, another framework we've developed internally. Khan provides our data scientists the ability to quickly productionize those models they've developed with open source frameworks in Python 3 (e.g. PyTorch, sklearn), by automatically packaging them as Docker containers and deploying to Amazon ECS. This provides our data scientist a one-click method of getting from their algorithms to production. We then integrate those deployments into a service mesh, which allows us to A/B test various implementations in our product.

For more info:

#DataScience #DataStack #Data

19 upvotes·35.3K views

Decision at Uber Technologies about Apache Spark, C#, OpenShift, JavaScript, Kubernetes, C++, Go, Node.js, Java, Python, Jaeger

Avatar of conor
Tech Brand Mgr, Office of CTO at Uber ·

How Uber developed the open source, end-to-end distributed tracing Jaeger , now a CNCF project:

Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second.

Here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve:

https://eng.uber.com/distributed-tracing/

(GitHub Pages : https://www.jaegertracing.io/, GitHub: https://github.com/jaegertracing/jaeger)

Bindings/Operator: Python Java Node.js Go C++ Kubernetes JavaScript OpenShift C# Apache Spark

10 upvotes·1 comment·170.6K views

Decision at Stitch Fix about Apache Spark, Victory, Amazon S3, Elasticsearch, Redux.js, React

Avatar of psunnn
Software Engineer at Stitch Fix ·

As a frontend engineer on the Algorithms & Analytics team at Stitch Fix, I work with data scientists to develop applications and visualizations to help our internal business partners make data-driven decisions. I envisioned a platform that would assist data scientists in the data exploration process, allowing them to visually explore and rapidly iterate through their assumptions, then share their insights with others. This would align with our team's philosophy of having engineers "deploy platforms, services, abstractions, and frameworks that allow the data scientists to conceive of, develop, and deploy their ideas with autonomy", and solve the pain of data exploration.

The final product, code-named Dora, is built with React, Redux.js and Victory, backed by Elasticsearch to enable fast and iterative data exploration, and uses Apache Spark to move data from our Amazon S3 data warehouse into the Elasticsearch cluster.

10 upvotes·10.1K views

Decision at Grupo Movile about Apache Spark

Avatar of movilebr

Artigo que introduz como estamos aplicando Apache Spark em projetos de Machine Learning.

Nesse artigo podemos ter uma visão de alto nível de como o Spark funciona, quais são as APIs disponíveis, o que cada uma delas se propõe a fazer e como configurá-lo e programar usando PySpark no Google Colab.

Sugiro que você dê uma lida na documentação do PySpark.sql e tente criar algumas análises diferentes da base de dados. Tente entender melhor como a API funciona e como o processamento é feito, só não vale usar Pandas, ok? Salvei o notebook com todos os comandos em [8]. Basta fazer o download, subir para o Google Colab e brincar com os comandos.

5 upvotes·403 views

Decision at Uber Technologies about Kafka Manager, Kafka, GitHub, Apache Spark, Hadoop

Avatar of conor
Tech Brand Mgr, Office of CTO at Uber ·

Why we built Marmaray, an open source generic data ingestion and dispersal framework and library for Apache Hadoop :

Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark . The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference:

https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

(Direct GitHub repo: https://github.com/uber/marmaray Kafka Kafka Manager )

3 upvotes·62.6K views

Decision about Apache Spark

Avatar of movilebr

Introdução sobre como aplicamos Apache Spark em projetos de Machine Learning.

Nesse artigo podemos ter uma visão de alto nível de como o Spark funciona, quais são as APIs disponíveis, o que cada uma delas se propõe a fazer e como configurá-lo e programar usando PySpark no Google Colab.

Sugiro que você dê uma lida na documentação do PySpark.sql e tente criar algumas análises diferentes da base de dados. Tente entender melhor como a API funciona e como o processamento é feito, só não vale usar Pandas, ok? Salvei o notebook com todos os comandos em [8]. Basta fazer o download, subir para o Google Colab e brincar com os comandos.

3 upvotes·242 views

Decision at Onedot about npm, Blueprint, Amazon S3, Apache Spark, Cassandra, TypeScript, Scala, Redux.js, React

Avatar of onedotadmin
CTO at Onedot ·

Onedot is building an automated data preparation service using probabilistic and statistical methods including artificial intelligence (AI). From the beginning, having a stable foundation while at the same time being able to iterate quickly was very important to us. Due to the nature of compute workloads we face, the decision for a functional programming paradigm and a scalable cluster model was a no-brainer. We started playing with Apache Spark very early on, when the platform was still in its infancy. As a storage backend, we first used Cassandra, but found out that it was not the optimal choice for our workloads (lots of rather smallish datasets, data pipelines with considerable complexity, etc.). In the end, we migrated dataset storage to Amazon S3 which proved to be much more adequate to our case. In the frontend, we bet on more traditional frameworks like React/Redux.js, Blueprint and a number of common npm packages of our universe. Because of the very positive experience with Scala (in particular the ability to write things very expressively, use immutability across the board, etc.) we settled with TypeScript in the frontend. In our opinion, a very good decision. Nowadays, transpiling is a common thing, so we thought why not introduce the same type-safety and mathematical rigour to the user interface?

2 upvotes·8.9K views

Decision about Apache Spark

Avatar of Wei-1

Spark is good at parallel data processing management. We wrote a neat program to handle the TBs data we get everyday. Apache Spark

1 upvote·1.2K views