How Ads Indexing Works at Pinterest

1,253
Pinterest
Pinterest's profile on StackShare is not actively maintained, so the information here may be out of date.

Introduction

Ads indexing is a data processing pipeline that builds servable ad documents for the ads delivering stack at Pinterest, including ads targeting, retrieval, ranking and auction. The pipeline converts ad campaigns in the raw format to the servable format via a heavy computational process with a lot of data joining. This is a key component in the ads delivery stack because all data used in ads delivery must go through this pipeline. The freshness and quality of the ads documents has a direct impact on Pinner and partner experiences. As the Pinterest ads business has been growing rapidly in the past years, there was an urgent need to scale up the size of index and lower the E2E indexing latency at the same time, which traditional batch index builders can not meet. Thus, we built a scalable incremental indexing system inspired by Google’s web indexing system “Caffeine”, which achieves seconds-level indexing E2E latency with the scale up to 100M+ documents.

System requirements

The ads indexing system is tailored to ads indexing business requirements. Let’s walk through the basic ads indexing knowledge to understand the requirements for this system.

All the data processed by ads indexing pipeline falls into two categories: ads control and ads content data. Ads control data is the ads metadata set up by advertisers used in ads targeting, auction bidding, etc. Ads content data consists of various signals we use to understand the ads Pin for optimizing ads delivery performance. Here we list a few representatives in each category:

Ads control data

  • Targeting specs
  • Budget
  • Bid
  • Creative type

Ads content data

  • Image signature
  • Pin interests(Coteries)
  • Text annotation
  • Pinvisual embedding
  • Historical performance data

Different data sources may have different SLAs on the data freshness and integrity. Here we list the general guidelines for two catalogs:

For ads control data, we need low latency and absolute data correctness, because the data freshness directly impacts advertisers’ experience and data correctness is the key to have advertisers’ trust. On the other hand, Ads content data has a relatively loose SLA on data freshness, as many upstream data sources have a low data update frequency. However, they still require strong data correctness as low quality data can lead to severe ads delivery performance issues. To meet these requirements, we designed the ads indexing system as a combination of real-time incremental and batch pipelines. Here are the major responsibilities for each pipeline.

Real-time incremental pipeline

  • Support distributed transactions on data processing to ensure the data consistency during large-scale concurrent processing
  • Support push-based notifications on ads control data changes to achieve seconds-level update-to-serve E2E latency
  • Have two logical pipelines as High and Medium priorities

Batch pipeline requirements

  • Keep the ads index update to date on Ads Content data via the periodical refresh on the pull-based data.
  • Achieve eventual consistency for both Ads control and content data as it covers any possible message loss or process failure in realtime pipeline
  • Have one logical pipeline as Low priority

In addition, this incremental and base combined indexing system provides better pipeline availability. If the real-time pipeline is clogged, the serving index will fall back to the base index generated by the batch pipeline. If the batch pipeline fails one time, the real-time serving index can still cover all the delta updates since the previous batch index. As the trade off, there is additional merging logic between two indices at serving.

High Level System Introduction

Fig.1: An overview on ads indexing system with data flow

The above graph describes the high level ads indexing system, where the highlighted components are the four core components as Gateway, Updater, Storage Repo and Argus.

  1. Gateway is a light-weighted stream process framework. It is in charge of converting the different update events from upstream sources to the ads indexing updater format, then sending them to Updater via either Kafka or direct RPC.
  2. Updater is the data ingestion component, which receives updates from Gateway and writes them into Storage Repo.
  3. Storage Repo keeps all data through the ad document process life cycle, including raw updates, intermediate data, and final servable documents. Storage Repo needs to support a few key features: 1) cross rows and tables transactions, 2) column level change notification to Argus.
  4. Argus is notification triggered data process services. After being notified by a data change event from Repo, an Argus worker reads all dependent data from data repo and conducts a computational heavy process to generate the final servable documents. The servable documents are written back to Repo with the transaction check as well as published to the realtime serving.

These four components are connected but under sufficient isolation. There are event queues among Gateway, Updater and Argus, which allow each component to process at their own speeds. Gateway, Updater and Argus are all stateless services, which are made to be easily scaled up and down. Updater and Argus both have three cluster instances as High, Medium and Low priorities. All four components can have their own release schedules as their interactive APIs are designed to be backward compatible.

In addition, there are a few batch workflows in the system as a supplement to the real-time incremental processing. They take inputs from database dumps to either trigger actions to Updater and Argus, or build base index outputs for serving. We will describe more details about each workflow below.

Implementation

Gateway

We built this lightweight stateless streamer service based on Kafka Stream, as it provides good support on “at least once process”. The major upstream is the mysql binlog from the ads database. As the mysql binlog contains only the delta change per transaction per table, we need to materialize it by querying ads database again. It also provides a short-cut path from ads database to serving by skipping the following heavy indexing process for some light-weighted serving documents, such as budget updates. For example, if a campaign budget is updated, Gateway will directly push this change to real-time budget control service via Kafka.

Updater

Updater is a lightweight data ingestion service built on Kafka stream. It basically tails Kafka topics, extracts structured data from messages and writes them into Storage Repo with transaction protection. For versioned data, it also does the version check before updating to keep data update sequence while dealing with unordered upstream events. Besides updates from the Kafka, Updater also takes updates from RPC from batch processes for the backfill purpose.

Storage Repo

Storage Repo is a columnar Key-Value storage that provides cross-table transactional operations for Updater and Argus. It also pushes the column level change notification to Argus. Because ads indexing pipeline needs to support full batch index building, Storage Repo also needs to integrate well with common big data workflows, such as snapshot dump, MapReduce jobs, and etc. Thus, we chose Apache Omid & HBase as the transactional nosql database for the Storage Repo. Omid is the transaction management service on top of HBase, which is inspired by Google’s Percolator. In Percolator, the change notifications are stored in a bigtable column and workers constantly scan partitioned range to pull change notifications. Because HBase can’t afford heavily scanning operations, we built the change notification service as HBase proxy, which converts HBase WAL to change notifications and publishes them to Kafka for data processing.

Argus

Argus is the service that monitors the changes in Storage Repo to further process changed data. It is built on Kafka consumer with data APIs to Storage Repo. It tails the change notification from Storage Repo, and runs various handlers according to event type. Each handler reads multiple columns across different rows and tables in Storage, generates derived data objects via data joinings or enrichments, writes them back to Storage Repo within a single transaction. We built ads handlers inside Argus to do the following work:

  • Ads handlers contain all the ads business logic of generating final servable Ads documents.
  • Ads handlers pull some of Ads content data from external data sources for data enrichments. There is an in-memory cache layer inside handlers to improve the efficiency
  • Ads handlers have options to publish the final servable documents to serving services via Kafka

Batch process workflows

There are a few batch process workflows running in this system to ensure the data integrity. Because we already have the real-time incremental process pipelines, the batch workflows are used mainly to compensate the incremental pipeline and they share the same execution binary on the data process. Here are the list of workflows running in system

  • Base index builder workflows: they run every few hours to generate the base index
  • Data sync workflows: 1) the data sync between Ads Database and Storage Repo ensures the data consistency of ads control data. 2) the data sync among different tables in Storage Repo to check the data consistency within Storage Repo.
  • Refresh workflows: they run periodically to mark all ads documents as newly updated to trigger the reprocessing for refreshing all pull type signals.
  • GC workflows: they run infrequently to remove very old inactive ad documents so that Storage Repo won’t grow infinitely to hurt its performance.

Downstream Services

The major downstream consumers are Ads Manas cluster (inverted index) and key-value store for Ads Scorpion (forward index). Both of them use delta architecture serving, taking base index and real-time per document update. The base index is published through S3 while the real-time document update is via Kafka broadcasting.

System Visibility

The final ad servable document contains more than 100 ads control and content fields, and many engineers actively contribute to ads indexing by adding new fields or modifying existing fields. It is necessary for the platform to provide a good E2E visibility for engineers to improve their developing velocity. Therefore, we provide the following features in the ads indexing system:

  • Integration test and Dev run: Gateway, Updater and Argus all provide the same RPC interfaces corresponding to their Kafka message ingestion interfaces. This setting simplifies the dev run for integration test on these components. Each component has its dev run with before/after comparison report for developers to understand the impact of their code changes to ads index
  • Code Release: Each service has its own release cycle. But we set an E2E staging pipeline for continuous deployment.
  • Presubmit integration test: Argus handlers have most complicated application logic and many more engineers work in its code base, thus we set the presubmit integration test on Argus handlers
  • Debugging UI: Storage Repo keeps all intermediate data and additional debugging information. We built a web-based UI for developers to easily access this information
  • Historical records: In addition to HBase versioning, we periodically snapshot HBase tables to the data warehouse for the historical records

In practice, the ads indexing team is often involved in various ads delivery debugging tickets, such as "why aren't my ads campaigns delivering?". Having the above visibility features can allow engineers to quickly locate the root cause.

System monitoring

The Ads indexing system is one of the most critical components in the ads delivery service, so we built a comprehensive monitoring for its system health in two major areas: 1) data freshness, 2) data coverage. The followings are the key metrics in our dashboard.

  • Incremental pipeline health: E2E latency per document, latency breakdown by stages, process throughputs, daily report on the number of dropped messages in ads incremental pipeline and etc.
  • Base pipeline health: Base index staleness, index data volume change, the coverage of critical fields in base index and etc.
  • Data consistency: Number of unsynced documents reported by various sync workflows.

Production Results

The Ads indexing system has been running smoothly in production for more than one year. The hybrid system does not only provide seconds level E2E latency most of the time, but also ensures the system high availability and data integrity. Here we select a few highlights from the production results

  • Ads control data update-to-serve p90 latency < 60 seconds in 99.9% time
  • Ads control data update-to-serve max latency < 24 hours in 100% time
  • Single-digit daily number of dropped messages in incremental pipeline

The Ads indexing system is highly resilient to handling various anomalies in production. With simple manual operations, it can recover very quickly from incidents to ensure data freshness and integrity. Here are some common production issues.

  • Ads real-time pipeline clogging: This type of problem is usually caused by a huge spike in upstream updates. In this case, we can choose to shut down the data publishing from incremental pipeline to realtime serving so that it won’t be overwhelmed. The pipeline relies on the sync workflow to keep base index update to date. The Gateway and Update are both set to drop stale messages to quickly clear up the message buffed in Kafka, so that the ads real-time pipeline can catch up very fast after the spike passes.
  • Incorrect data introduced by release: This is usually caught in staging vs prod monitoring, data coverage check in base index workflow, or performance degradation in prod. We can temporarily shut down the realtime serving and roll back base index to a good version. To clean up the polluted data in Storage Repo, we can trigger the refresh workflow to reprocess all documents. If we know only partial data is polluted, we can refresh the selective documents to shorten the recovery time.
  • Data quality/coverage drop: This often starts in the external pull-type data sources and get caught in the coverage check in the base index workflow. We can rollback external data sources and backfill the data through refresh workflow. Because the backfill goes through the low priority pipeline, it won’t impact the high and medium priority pipelines to slow down the processing of updates from critical data sources.

Brief technical discussions

Throughout the process of building the ads indexing system from scratch to fully meet all business requirements with production maturity, we made a few design decisions to balance between system flexibility, scalability, stability as well as implementation difficulty.

Data joining via remote Storage Vs multi-stream joining with local storage: Ads indexing involves many data joining operations upon update notifications as the ads document enrichment needs data joined on several keys such as adID, pinID, imageSignatureID, advertiserID, etc. We chose to do the data joining via fetching them from remote Storage with transactions instead of doing multi-streaming joining from local state. By doing so, the system provides more flexibility for distributed data processing applications as they don’t need to choose data partition key to avoid racing conditions. As the tradeoff, it pays an extra roundtrip for data fetching. But since the ads indexing processor is computational heavy so that this roundtrip cost is a tiny proportion of the total infra cost.

Decoupled services Vs merged services: During the design phase, we did consider merging services to have fewer components in ads indexing system, which is a tradeoff between system isolation and simplicity. Gateway and Updater could be merged to save one message hop. However, we chose to separate them as Gateway is a streaming process service while Updater is more on the data ingestion service. Now Gateway and Updater have quite different paths for evolving. Gateway became a popular lightweight streamer framework for a few other use cases, while Updater starts to ingest Ads content updates as they don’t need signal materialization as ads control data. Updater and Argus could be merged as they are both built on Kafka consumers with interactions to Storage Repo. We chose to have them as separate services, because they have quite different directions for performance optimizations; Updater is a very light weighted data extractor and sinker, while Argus is a heavy lifting computational component.

System scalability: there is one scalability bottleneck in ads indexing system as Omid relies on a centralized transaction manager for timestamp allocation and conflict resolution. To address this scalability limitation, we customized Omid at Pinterest to improve its throughout for better scalability, but its implementation details are beyond the scope of this tech blog.

What’s next

The ads indexing system has evolved since its initial launch. We’ve been focusing on improving system stability and visibility. In addition, some of the core components are adapted by other application as they are designed to be running independently from the beginning. For example, Gateway is widely used as the lightweight streaming framework in ads retargeting and advertiser experiences. Moving forward, we plan to focus on improving system usability to further improve developer velocity in the following areas:

  • Build a centralized config-based component for developers to manage data flows through ads indexing system in one single place. Right now the developers have to config in multiple places to cherry pick signals/fields among different data schemas. We also plan to build an assistant visualization tool based on configs for developers to easily browse the existing signals and their dependencies for data explorations.
  • Enhance the pipeline health monitoring by integrating with the probing type test framework, as the current data health are mostly relying on the aggregation data view. Vertical feature developers have hard times to leverage aggregated metrics for monitoring, and they have to run ad hoc offline integration tests or data analysis reports to monitor the health of narrowed data segments. The probing framework can allow developers to launch their real-time validation tests on particular type of documents in production.

Acknowledgments: Many engineers at Pinterest worked together to build ads indexing system, including Ji Hong, Grace Chin, Sreshta Vijayaraghavan, Lianghong Xu, Qi Li, Kapil Bajaj, Kevin Lin, Mingsi Liu, Mingjian Liu, Susan Liu and Ang Xu. Huge thanks to Caijie Zhang, Sam Meder, Chiyoung Seo, Zheng Liu, Zack Drach, Liquan Pei and the entire Ads Infra team for design discussions. We would also like to thank our peer teams for supporting — Storage & Caching, Logging, Search Infra, Serving Systems, Ads SRE, Dev Tool, etc.

We’re building the world’s first visual discovery engine. More than 320 million people around the world use Pinterest to dream about, plan and prepare for things they want to do in life. Come join us!

Pinterest
Pinterest's profile on StackShare is not actively maintained, so the information here may be out of date.
Tools mentioned in article
Open jobs at Pinterest
Sr. Staff Software Engineer, Ads ML I...
San Francisco, CA, US; , CA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p>Creating a life you love also means finding a career that celebrates the unique perspectives and experiences that you bring. As you read through the expectations of the position, consider how your skills and experiences may complement the responsibilities of the role. We encourage you to think through your relevant and transferable skills from prior experiences.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p>Pinterest is one of the fastest growing online advertising platforms. Continued success depends on the machine-learning systems, which crunch thousands of signals in a few hundred milliseconds, to identify the most relevant ads to show to pinners. You’ll join a talented team with high impact, which designs high-performance and efficient ML systems, in order to power the most critical and revenue-generating models at Pinterest.</p> <p><strong>What you’ll do</strong></p> <ul> <li>Being the technical leader of the Ads ML foundation evolution movement to 2x Pinterest revenue and 5x ad performance in next 3 years.</li> <li>Opportunities to use cutting edge ML technologies including GPU and LLMs to empower 100x bigger models in next 3 years.&nbsp;</li> <li>Tons of ambiguous problems and you will be tasked with building 0 to 1 solutions for all of them.</li> </ul> <p><strong>What we’re looking for:</strong></p> <ul> <li>BS (or higher) degree in Computer Science, or a related field.</li> <li>10+ years of relevant industry experience in leading the design of large scale &amp; production ML infra systems.</li> <li>Deep knowledge with at least one state-of-art programming language (Java, C++, Python).&nbsp;</li> <li>Deep knowledge with building distributed systems or recommendation infrastructure</li> <li>Hands-on experience with at least one modeling framework (Pytorch or Tensorflow).&nbsp;</li> <li>Hands-on experience with model / hardware accelerator libraries (Cuda, Quantization)</li> <li>Strong communicator and collaborative team player.</li> </ul><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</p> <p><em><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found <a href="https://www.pinterestcareers.com/pinterest-life/" target="_blank">here</a>.</span></em></p></div><div class="title">US based applicants only</div><div class="pay-range"><span>$135,150</span><span class="divider">&mdash;</span><span>$278,000 USD</span></div></div></div><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require an accommodation during the job application process, please notify&nbsp;<a href="mailto:accessibility@pinterest.com">accessibility@pinterest.com</a>&nbsp;for support.</p></div>
Senior Staff Machine Learning Enginee...
San Francisco, CA, US; , CA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p>Creating a life you love also means finding a career that celebrates the unique perspectives and experiences that you bring. As you read through the expectations of the position, consider how your skills and experiences may complement the responsibilities of the role. We encourage you to think through your relevant and transferable skills from prior experiences.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p>We are looking for a highly motivated and experienced Machine Learning Engineer to join our team and help us shape the future of machine learning at Pinterest. In this role, you will tackle new challenges in machine learning that will have a real impact on the way people discover and interact with the world around them.&nbsp; You will collaborate with a world-class team of research scientists and engineers to develop new machine learning algorithms, systems, and applications that will bring step-function impact to the business metrics (recent publications <a href="https://arxiv.org/abs/2205.04507">1</a>, <a href="https://dl.acm.org/doi/abs/10.1145/3523227.3547394">2</a>, <a href="https://arxiv.org/abs/2306.00248">3</a>).&nbsp; You will also have the opportunity to work on a variety of exciting projects in the following areas:&nbsp;</p> <ul> <li>representation learning</li> <li>recommender systems</li> <li>graph neural network</li> <li>natural language processing (NLP)</li> <li>inclusive AI</li> <li>reinforcement learning</li> <li>user modeling</li> </ul> <p>You will also have the opportunity to mentor junior researchers and collaborate with external researchers on cutting-edge projects.&nbsp;&nbsp;</p> <p><strong>What you'll do:&nbsp;</strong></p> <ul> <li>Lead cutting-edge research in machine learning and collaborate with other engineering teams to adopt the innovations into Pinterest problems</li> <li>Collect, analyze, and synthesize findings from data and build intelligent data-driven model</li> <li>Scope and independently solve moderately complex problems; write clean, efficient, and sustainable code</li> <li>Use machine learning, natural language processing, and graph analysis to solve modeling and ranking problems across growth, discovery, ads and search</li> </ul> <p><strong>What we're looking for:</strong></p> <ul> <li>Mastery of at least one systems languages (Java, C++, Python) or one ML framework (Pytorch, Tensorflow, MLFlow)</li> <li>Experience in research and in solving analytical problems</li> <li>Strong communicator and team player. Being able to find solutions for open-ended problems</li> <li>8+ years working experience in the r&amp;d or engineering teams that build large-scale ML-driven projects</li> <li>3+ years experience leading cross-team engineering efforts that improves user experience in products</li> <li>MS/PhD in Computer Science, ML, NLP, Statistics, Information Sciences or related field</li> </ul> <p><strong>Desired skills:</strong></p> <ul> <li>Strong publication track record and industry experience in shipping machine learning solutions for large-scale challenges&nbsp;</li> <li>Cross-functional collaborator and strong communicator</li> <li>Comfortable solving ambiguous problems and adapting to a dynamic environment</li> </ul> <p>This position is not eligible for relocation assistance.</p> <p>#LI-SA1</p> <p>#LI-REMOTE</p><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</p> <p><em><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found <a href="https://www.pinterestcareers.com/pinterest-life/" target="_blank">here</a>.</span></em></p></div><div class="title">US based applicants only</div><div class="pay-range"><span>$158,950</span><span class="divider">&mdash;</span><span>$327,000 USD</span></div></div></div><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require an accommodation during the job application process, please notify&nbsp;<a href="mailto:accessibility@pinterest.com">accessibility@pinterest.com</a>&nbsp;for support.</p></div>
Staff Software Engineer, ML Training
San Francisco, CA, US; , CA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p>Creating a life you love also means finding a career that celebrates the unique perspectives and experiences that you bring. As you read through the expectations of the position, consider how your skills and experiences may complement the responsibilities of the role. We encourage you to think through your relevant and transferable skills from prior experiences.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p>The ML Platform team provides foundational tools and infrastructure used by hundreds of ML engineers across Pinterest, including recommendations, ads, visual search, growth/notifications, trust and safety. We aim to ensure that ML systems are healthy (production-grade quality) and fast (for modelers to iterate upon).</p> <p>We are seeking a highly skilled and experienced Staff Software Engineer to join our ML Training Infrastructure team and lead the technical strategy. The ML Training Infrastructure team builds platforms and tools for large-scale training and inference, model lifecycle management, and deployment of models across Pinterest. ML workloads are increasingly large, complex, interdependent and the efficient use of ML accelerators is critical to our success. We work on various efforts related to adoption, efficiency, performance, algorithms, UX and core infrastructure to enable the scheduling of ML workloads.</p> <p>You’ll be part of the ML Platform team in Data Engineering, which aims to ensure healthy and fast ML in all of the 40+ ML use cases across Pinterest.</p> <p><strong>What you’ll do:</strong></p> <ul> <li>Implement cost effective and scalable solutions to allow ML engineers to scale their ML training and inference workloads on compute platforms like Kubernetes.</li> <li>Lead and contribute to key projects; rolling out GPU sharing via MIGs and MPS , intelligent resource management, capacity planning, fault tolerant training.</li> <li>Lead the technical strategy and set the multi-year roadmap for ML Training Infrastructure that includes ML Compute and ML Developer frameworks like PyTorch, Ray and Jupyter.</li> <li>Collaborate with internal clients, ML engineers, and data scientists to address their concerns regarding ML development velocity and enable the successful implementation of customer use cases.</li> <li>Forge strong partnerships with tech leaders in the Data and Infra organizations to develop a comprehensive technical roadmap that spans across multiple teams.</li> <li>Mentor engineers within the team and demonstrate technical leadership.</li> </ul> <p><strong>What we’re looking for:</strong></p> <ul> <li>7+ years of experience in software engineering and machine learning, with a focus on building and maintaining ML infrastructure or Batch Compute infrastructure like YARN/Kubernetes/Mesos.</li> <li>Technical leadership experience, devising multi-quarter technical strategies and driving them to success.</li> <li>Strong understanding of High Performance Computing and/or and parallel computing.</li> <li>Ability to drive cross-team projects; Ability to understand our internal customers (ML practitioners and Data Scientists), their common usage patterns and pain points.</li> <li>Strong experience in Python and/or experience with other programming languages such as C++ and Java.</li> <li>Experience with GPU programming, containerization, orchestration technologies is a plus.</li> <li>Bonus point for experience working with cloud data processing technologies (Apache Spark, Ray, Dask, Flink, etc.) and ML frameworks such as PyTorch.</li> </ul> <p>This position is not eligible for relocation assistance.</p> <p>#LI-REMOTE</p> <p><span data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;#LI-AH2&quot;}" data-sheets-userformat="{&quot;2&quot;:14464,&quot;10&quot;:2,&quot;14&quot;:{&quot;1&quot;:2,&quot;2&quot;:0},&quot;15&quot;:&quot;Helvetica Neue&quot;,&quot;16&quot;:12}">#LI-AH2</span></p><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</p> <p><em><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found <a href="https://www.pinterestcareers.com/pinterest-life/" target="_blank">here</a>.</span></em></p></div><div class="title">US based applicants only</div><div class="pay-range"><span>$135,150</span><span class="divider">&mdash;</span><span>$278,000 USD</span></div></div></div><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require an accommodation during the job application process, please notify&nbsp;<a href="mailto:accessibility@pinterest.com">accessibility@pinterest.com</a>&nbsp;for support.</p></div>
Distinguished Engineer, Frontend
San Francisco, CA, US; , US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p>Creating a life you love also means finding a career that celebrates the unique perspectives and experiences that you bring. As you read through the expectations of the position, consider how your skills and experiences may complement the responsibilities of the role. We encourage you to think through your relevant and transferable skills from prior experiences.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p>As a Distinguished Engineer at Pinterest, you will play a pivotal role in shaping the technical direction of our platform, driving innovation, and providing leadership to our engineering teams. You'll be at the forefront of developing cutting-edge solutions that impact millions of users.</p> <p><strong>What you’ll do:</strong></p> <ul> <li>Advise executive leadership on highly complex, multi-faceted aspects of the business, with technological and cross-organizational impact.</li> <li>Serve as a technical mentor and role model for engineering teams, fostering a culture of excellence.</li> <li>Develop cutting-edge innovations with global impact on the business and anticipate future technological opportunities.</li> <li>Serve as strategist to translate ideas and innovations into outcomes, influencing and driving objectives across Pinterest.</li> <li>Embed systems and processes that develop and connect teams across Pinterest to harness the diversity of thought, experience, and backgrounds of Pinployees.</li> <li>Integrate velocity within Pinterest; mobilizing the organization by removing obstacles and enabling teams to focus on achieving results for the most important initiatives.</li> </ul> <p>&nbsp;<strong>What we’re looking for:</strong>:</p> <ul> <li>Proven experience as a distinguished engineer, fellow, or similar role in a technology company.</li> <li>Recognized as a pioneer and renowned technical authority within the industry, often globally, requiring comprehensive expertise in leading-edge theories and technologies.</li> <li>Deep technical expertise and thought leadership that helps accelerate adoption of the very best engineering practices, while maintaining knowledge on industry innovations, trends and practices.</li> <li>Ability to effectively communicate with and influence key stakeholders across the company, at all levels of the organization.</li> <li>Experience partnering with cross-functional project teams on initiatives with significant global impact.</li> <li>Outstanding problem-solving and analytical skills.</li> </ul> <p>&nbsp;</p> <p>This position is not eligible for relocation assistance.</p> <p>&nbsp;</p> <p>#LI-REMOTE</p> <p>#LI-NB1</p><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</p> <p><em><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found <a href="https://www.pinterestcareers.com/pinterest-life/" target="_blank">here</a>.</span></em></p></div><div class="title">US based applicants only</div><div class="pay-range"><span>$242,029</span><span class="divider">&mdash;</span><span>$498,321 USD</span></div></div></div><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require an accommodation during the job application process, please notify&nbsp;<a href="mailto:accessibility@pinterest.com">accessibility@pinterest.com</a>&nbsp;for support.</p></div>
You may also like