Empowering Pinterest Data Scientists and Machine Learning Engineers with PySpark

384
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

Data scientists and machine learning engineers at Pinterest found themselves hitting major challenges with existing tools. Hive and Presto were readily accessible tools for large scale data transformations, but complex logic is difficult to write in SQL. Some engineers wrote complex logics in Cascading or Scala Spark jobs, but these have a steep learning curve and take significantly more time to learn and build jobs. Furthermore, data scientists and machine learning engineers often trained models in a small-scale notebook environment, but they lacked the tools to perform large-scale inference.

To combat these challenges, we, (machine learning and data processing platform engineers), built and productionized PySpark infrastructure. The PySpark infrastructure gives our users the following capabilities:

  • Writing logic using the familiar Python language and libraries, in isolated environments that allow experimenting with new packages.
  • Rapid prototyping from our JupyterHub deployment, enabling users to interactively try out feature transformations, model ideas, and data processing jobs.
  • Integration with our internal workflow system, so that users can easily productionize their PySpark applications as scheduled workflows.

PySpark on Kubernetes as a minimum viable product (MVP)

We first built an MVP PySpark infrastructure on Pinterest Kubernetes infrastructure with Spark Standalone Mode and tested with users for feedback.

Figure 1. An overview of the MVP architecture

The infrastructure consists of Kubernetes pods carrying out different tasks:

  • Spark Master managing cluster resources
  • Workers — where Spark executors are spawned
  • Jupyter servers assigned to each user

When users launch PySpark applications from those Jupyter servers, Spark drivers are created in the same pod as Jupyter and the requested executors in worker pods.

This architecture enabled our users to experience the power of PySpark for the first time. Data scientists were able to quickly grasp Python UDFs, transform features, and perform batch inference of TensorFlow models with terabytes of data.

This architecture, however, had some limitations:

  • Jupyter notebook and PySpark driver share resources since they are in the same pod.
  • Driver’s port and address are hard-coded in the config.
  • Users can launch only one PySpark application per assigned Jupyter server.
  • Python dependency per user/team is difficult.
  • Resource management is limited to FIFO approach across all the users (no queue defined).

As the demand for PySpark grew, we worked on a production-grade PySpark infrastructure based on Yarn, Livy, and Sparkmagic.

Production-grade PySpark infrastructure

Figure 2: An overview of the production architecture

In this architecture, each Spark application runs on the YARN cluster. We use Apache Livy to proxy between our internal JupyterHub, the Spark application and the YARN cluster. On Jupyter, Sparkmagic provides a PySpark kernel that forwards the PySpark code to a running Spark application. Conda provides isolated Python environments for each application.

With this architecture, we offer two development approaches.

Interactive development:

  1. A user creates a conda environment zip containing Python packages they need, if any.
  2. From JupyterHub, they create a notebook with PySpark kernel from Sparkmagic.
  3. In the notebook, they declare resources required, conda environment, and other configuration. Livy launches a Spark application on the YARN cluster.
  4. Sparkmagic ships the user’s Jupyter cells (via Livy) to the PySpark application. Livy proxies results back to the Jupyter notebook.

See the attached picture (see Appendix) for a full annotated example of a Jupyter notebook.

Non-interactive development (ad-hoc and production workflow runs):

  1. A Pinterest-internal Job Submission Service acts as the gateway to the YARN cluster.
  2. In development, the user’s local Python code base is packaged into an archive and submitted to launch a PySpark application in YARN.
  3. In scheduled production runs, the production build’s archive is submitted instead.

Benefits

This infrastructure offers us the following benefits:

  1. No resources sharing between Jupyter notebook and PySpark drivers
  2. No hard-coded drivers’ ports and addresses
  3. Users can launch many PySpark applications
  4. Efficient resource allocation and isolation with aggressive dynamic allocation for high resource utilization
  5. Python dependency per user is supported
  6. Resource accountable
  7. Dr. Elephant for PySpark Job analyses

Technical details

Pinterest JupyterHub Integration: (benefits #1,2,3)

We made the Sparkmagic kernel available in Jupyter. When the kernel is selected, a config managed by ZooKeeper is loaded with all necessary dependencies.

We set up Apache Livy, which provides a REST API proxy from Jupyter to the YARN cluster and PySpark applications.

A YARN cluster: (benefit #4)

  • Efficient resource allocation and isolation. We define a queue structure with Fair Scheduler to ensure dedicated resources and preemptable under certain conditions (e.g. after waiting for at least 10 minutes) but a portion of non-preemptable resources will be held for queues with minResource being set. Scheduler and resource manager logs are to manage cluster resources.
  • Aggressive Dynamic allocation policy for high resource utilization. We set the policy where a PySpark application holds at most a certain amount of executors and automatically releases resources once they don’t need. This policy makes sure resources are recycled faster, leading to a better resource utilization.

Python Dependency Management: (benefit #5)

Users can try various Python libraries (e.g. different ML frameworks) without asking platform engineers to install them. To that end, we created a Jenkins job to package a conda environment based on a requirement file, and archive it as a zip file on S3. PySpark applications launched with “ — archives” to broadcast zip file to driver along with all executors, and reset both “PYSPARKPYTHON” (for driver) as well as “spark.yarn.appMasterEnv.PYSPARKPYTHON” (for executors). That way, each application runs under in an isolated Python environment with all libraries needed.

Integrating with Pinterest-internal Job Submission Service (JSS): (benefit #6)

To productionize PySpark applications, users leverage the internal workflow system to schedule. We provided a workflow template to integrate with job submission interfaces to specify code location, parameters, and a Python environment artifact to use.

Self-service job performance analysis: (benefit #7)

We forked the open-sourced Dr. Elephant, and added new heuristics to analyze application’s configuration with various kinds of runtime metrics (executor, job, stage, …). This service provides tuning suggestions and offers guidelines on how to write a spark job properly. The service alleviates users’ debugging-and-troubleshooting pain, boosting the velocity. Moreover, it avoids resource waste and improves cluster stability. Below is an example of the performance analysis.

Figure 3: An overview of Dr. Elephant

Impacts

PySpark is now being used throughout our Product Analytics and Data Science, and Ads teams for a wide range of use cases.

  • Training: users can train models with mllib or any Python machine learning frameworks (e.g. TensorFlow) iteratively with any size of data.
  • Inference: users can test and productionize their Python codes for inferences without depending on platform engineers.
  • Ad-hoc analyses: users can perform various ad-hoc analyses as needed.

Moreover, our users now have the freedom to explore various Python dependencies and use Python UDF for large scale data.

Acknowledgement

We thank David Liu (EM, Machine Learning Platform team), Ang Zhang (EM, Data Processing Platform team), Tais (our TPM), Pinterest Product Analytics and Data Science organization (Sarthak Shah, Grace Huang, Minli Zhang, Dan Lee, Ladi Ositelu), Compute-Platform team (Harry Zhang, June Liu), Data Processing Platform team (Zaheen Aziz), Jupyter team (Prasun Ghosh — Tech Lead) for their support and the collaborations.

Appendix — An example of our use-case (Appendix):

Below is an example of how our users train a model, and run inference logic at scale from their Jupyter notebook with PySpark. We leave explanations in each cell.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Senior Engineering Manager, Homefeed ...
San Francisco, CA

About Pinterest:

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. As a Pinterest employee, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping users make their lives better in the positive corner of the internet.

Homefeed is a discovery platform at Pinterest that helps users find and explore their personal interests. We work with some of the largest datasets in the world, tailoring over billions of unique content to 330M+ users. Our content ranges across all categories like home decor, fashion, food, DIY, technology, travel, automotive, and much more. Our dataset is rich with textual and visual content and has nice graph properties — harnessing these signals at scale is a significant challenge. The homefeed ranking team focuses on the machine learning model that predicts how likely a user will interact with a certain piece of content, as well as leveraging those individual prediction scores for holistic optimization to present users with a feed of diverse content.

What you’ll do:

  • Technical lead and engineering manager for the Homefeed Ranking team in San Francisco
  • Help drive technical strategy and longer term vision for machine learning and recommendation at Pinterest
  • Lead a senior team of 10 Machine Learning engineers
  • Hands-on role, spending 60% time on technical leadership/IC work and 40% time on people management
  • Use machine learning / deep learning techniques to solve of the most large scale recommendation problems in the industry
  • Collaborate with partner teams like product, data science, business, ads

What we’re looking for:

  • Graduate degree plus 5+ years of industry experience 
  • Technical lead experience and some engineering management experience 
  • Strong machine learning background within ranking, recommendations, optimization or similar ML problems

#LI-EA2

Senior Staff Machine Learning Enginee...
San Francisco, CA

About Pinterest:

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. As a Pinterest employee, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping users make their lives better in the positive corner of the internet.

Homefeed is a discovery platform at Pinterest that helps users find and explore their personal interests. We work with some of the largest datasets in the world, tailoring over billions of unique content to 330M+ users. Our content ranges across all categories like home decor, fashion, food, DIY, technology, travel, automotive, and much more. Our dataset is rich with textual and visual content and has nice graph properties — harnessing these signals at scale is a significant challenge. The Homefeed ranking team focuses on the machine learning model that predicts how likely a user will interact with a certain piece of content, as well as leveraging those individual prediction scores for holistic optimization to present users with a feed of diverse content.

What you’ll do:

  • Work on state-of-the-art large-scale applied machine learning projects
  • Improve relevance and the user experience on Homefeed
  • Re-architect our deep learning models to improve their capacity and enable more use cases
  • Collaborate with other teams to build/incorporate various signals to machine learning models
  • Collaborate with other teams to extend our machine learning based solutions to other use cases

What we’re looking for:

  • Passionate about applied machine learning and deep learning
  • 8+ years experience applying machine learning methods in settings like recommender systems, search, user modeling, image recognition, graph representation learning, natural language processing

#L1-EA2

Principal Engineer, Machine Learning ...
San Francisco, CA

About Pinterest

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. As a Pinterest employee, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping users make their lives better in the positive corner of the internet.

The focus of Discovery at Pinterest is the user.  Enabling our users through recommendations and search are core to the product and key use cases.  The Discovery organization enables this functionality through a deep understanding of our content and sophisticated machine learning systems for Personalized Search and Recommendations.  Machine Learning is a cornerstone strategy for achieving this and is integrated into almost every product.

What you’ll do

  • Design and architect machine learning solutions, models and systems that are modular and easily extendible
  • Provide technical vision and strategy based on deep insights to chart the course for machine learning at Pinterest. 
  • Lead product development and solve complex technical challenges. Lead technical efforts while effectively managing stakeholder relationships and balancing priorities. Utilize effective communication skills and a strong ability to collaborate.
  • Lead the team. Be a talent magnet and collaborate with our existing leaders while nurturing our junior ML engineers
  • Be the face and voice for Pinterest ML strategies both internally and externally

Who you are

  • 12+ years of experience in software engineering,  Including at least 8+ years working on machine learning.
  • Expert in machine learning with deep understanding in a specific area such as deep learning, active learning, machine perception or natural language processing.
  • Experience building real-world systems at internet-scale for solving problems in recommendation systems, search, computer vision or content understanding. 
  • Strong passion for research and development with experience in solving hard analytical problems

What experience you’ll bring

  • A compelling and inspiring vision to help shape and define our long term ML technical strategy and roadmap for large complex machine learning initiatives 
  • Strong cross-functional partnerships and proven ability to work across diverse engineering teams supported by product management.  Demonstrated ability to translate business needs into engineering roadmaps while considering technology trade-offs 
  • Thought leadership with publications and patents in machine learning, AI, data science, data analytics, statistics, or related fields
  • Strong influencing skills to build and direct the Pinterest’s ML community, who is passionate about speaking and presenting at various conferences
  • Excellent communication skills with the ability to explain complex technical concepts to both technical and non-technical audiences. 
  • Builder of innovation engineering culture. Mentor and coach junior team members by promoting the best engineering practices.

#LI-SJ3

Engineering Manager, Shopping Content
San Francisco, CA

About Pinterest:

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. As a Pinterest employee, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping users make their lives better in the positive corner of the internet.

The Shopping Content team at Pinterest is responsible for developing one of the largest product catalogs in the world containing products from all major retailers across the world. Various shopping experiences at Pinterest e.g. shopping recommendations, shopping search, shop the look, shopping Ads etc. are built on top of this product catalog. The team is responsible for solving unique technical challenges of acquiring and reconciling product catalogs from various sources (feed, crawling, scraping, javascript tags), canonicalizing products and variants, understanding product attributes and product relationships. The EM role provides unique perspectives on solving large scale system problems, e.g. reconciling different catalog sources, serving 1B+ catalog metadata events at realtime for 350M+ Pinterest users worldwide etc. as well as solving Machine Learning problems like smart scraping of metadata, product attribute and variant understanding, product matching across merchants etc. The team is very well positioned to drive a tremendous impact on Pinner’s shopping experience and Pinterest’s revenue through a more accurate, higher quality and larger product catalog.  

What you'll do:

  • Technical lead and engineering manager for the Shopping content team in San Francisco
  • Lead the effort to develop the product catalog for Pinterest
  • Help drive technical strategy and longer term vision for Shopping at Pinterest
  • Lead a team of software engineers and machine learning engineers
  • Be hands-on, spending 60% time on technical leadership/IC work and 40% time on people management
  • Collaborate with partner teams like shopping front end, shopping discovery, shopping Ads

What we're looking for:

  • Ph.D. and 5+ years of experience or Masters and 8+ years of experience
  • Engineering Management experience for team of 10+ Engineers
  • Strong background in developing large scale systems
  • Experience with streaming and real-time data serving systems (e.g. Kafka, NoSQL, inverted indexes)
  • Experience with big data technologies like MapReduce/Hadoop/Hive/Presto/Spark
  • Familiarity with Machine Learning, particularly ML used in content understanding

#LI-LP1

Verified by
Security Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like