Improving Distributed Caching Performance and Efficiency at Pinterest

940
Pinterest
Pinterest's profile on StackShare is not actively maintained, so the information here may be out of date.

Kevin Lin | Software Engineer, Storage and Caching


Introduction

Pinterest’s distributed caching system, built on top of open source technologies memcached and mcrouter, is a critical component of the production infrastructure stack. Pinterest’s cache-as-a-service platform is responsible for driving down application latency across the board, reducing the overall cloud cost footprint, and ensuring adherence to strict sitewide availability targets.

Today, Pinterest’s memcached fleet spans over 5000 EC2 instances across a variety of instance types optimized along compute, memory, and storage dimensions. Collectively, the fleet serves up to ~180 million requests per second and ~220 GB/s of network throughput over a ~460 TB active in-memory and on-disk dataset, partitioned among ~70 distinct clusters.

As a core driver of reduced sitewide latency, the distributed caching tier is subject to stringent performance and latency requirements. Additionally, a key consequence of the sheer size of the fleet is that even small efficiency optimizations have an outsized impact on the total service cost footprint. Several years of operational experience running memcached at scale in production have provided unique insight into practical optimizations for driving improved performance and efficiency across the entire caching stack.

In this article, we will share some context on the observability and performance testing tools that enable optimization exploration work, followed by a deep dive into practical optimizations currently running in our production environment along dimensions of hardware selection strategy, compute efficiency, and networking performance.

High-level description of the available surface area for performance optimization for memcached running on virtual machines in public cloud environments

Monitoring, observability, and evaluation

All performance optimization efforts start with precise quantitative measurement and a structured, reproducible mechanism for generating workloads for isolated evaluation.

Critical monitoring prerequisites for all the performance evaluation conducted over the years include:

  • Server-side metrics for request throughput, network throughput, resource utilization, and hardware-level parameters (NIC statistics like per-queue packet throughput and EC2 allowance exhaustion, disk response times and in-flight I/O requests, etc.)
  • Client-side metrics for cache request percentile latency, timeout and error rates, and per-server availability (SLIs), as well as top-level application performance indicators like service RPC P99 response time

Pinterest leverages both synthetic load generation and production shadow traffic to evaluate the impact of tuning and optimizations. Historically, synthetic benchmarking has been useful for detecting performance regressions or improvements under maximum load, while shadow traffic evaluation has been more reflective of server resource utilization and overall performance under a real workload at scale.

  • Synthetic load generation: memtier_benchmark is an open source tool capable of generating a synthetic load against a memcached cluster with configurable parameters for the number of concurrent clients and threads, read/write ratio, data sizes, and transport mechanism (plaintext or TLS).
  • Production shadow traffic: mcrouter is an open source memcache-protocol routing proxy deployed as a client-side sidecar in the Pinterest fleet. It provides building blocks to design transparent shadow traffic routing policies with configurable traffic percentages and source/target cluster(s), allowing for flexible dark traffic experimentation across a variety of workload classes.

Together, these tools permit high-signal performance evaluation with zero or minimal impact to critical-path production traffic.

Performance and efficiency

Cloud hardware

Distributed caching at Pinterest serves a diverse array of workloads. In general, each class of workload can be categorized along the following high-level dimensions:

  • Throughput (compute)
  • Data volume (memory and/or disk capacity)
  • Data bandwidth (network and compute)
  • Latency requirement (compute)

While memcached can be arbitrarily horizontally scaled in and out to address a particular cluster’s bottleneck, vertically scaling individual hardware dimensions allows for greater cost efficiency for specific workloads. In practice, this entails standardization on a fixed pool of EC2 instance types optimized for each workload class.

Workload profile: Moderate throughput, moderate data volume

EC2 instance family: r5

Rationale: r5 family instances offer a vCPU-DRAM ratio that works well for most vanilla cache use cases at Pinterest. This instance type is considered the “baseline” against which others are evaluated.

Workload profile: High throughput, low data volume

EC2 instance family: c5

Rationale: c5 family instances are more cost efficient for use cases that would otherwise slot into the r5 type but hold significantly less memory. Maintaining the same vCPU count as its r5 counterpart allows it to serve the same throughput volume at a lower overall cost.

Workload profile: High data volume, relaxed latency requirement

EC2 instance family: r5d

Rationale: r5d family instances are functionally equivalent to r5 family instances but with an instance-colocated NVMe SSD used by extstore for secondary storage. r5d is cost efficient for clusters with high data volume such that there are tangible improvements to hit rate as data is written to disk. Due to the slower disk (relative to i3 family instances), higher tail latency is expected.

Workload profile: Massive data volume, relaxed latency requirement

EC2 instance family: i3 and i3en

Rationale: i3 and i3en family instances ship with a fast and sizable instance-colocated disk, which tangibly increases extstore performance for workloads with a very high ratio of working data on disk relative to DRAM. Additionally, they offer comparable memory capacity to r5 series instances, which reduces extstore thrashing by maintaining a reasonable DRAM to disk usage ratio.

EC2 instance type distribution for the Pinterest memcached fleet

In particular, using extstore to expand storage capacity beyond DRAM into a local NVMe flash disk tier increases per-instance storage efficiency by up to several orders of magnitude, and it reduces the associated cluster cost footprint proportionally. EC2’s storage-optimized instance types provide locally attached solid state drives capable of high random IOPS and R/W throughput, allowing onboarding of extstore use cases with massive data volumes and high request throughput without compromising tail latency.

The introduction of different shapes of storage-optimized EC2 instance types to the fleet (in particular, the lower-tier variants of the i3en instance family containing multiple independent disks per instance) further drives down costs while offering improvements in I/O performance and cost efficiency. Pinterest configures these instances with Linux software RAID at level RAID0 to combine multiple hardware block devices into a single, logical disk for userspace consumption. By striping reads and writes fairly across two disks, RAID0 doubles maximum theoretical I/O throughput with a best-case two-fold reduction in effective disk response time at the cost of a doubled MTTF. This increased hardware performance for extstore at the expense of an increased theoretical failure rate is a highly worthwhile tradeoff. Operating workloads on a public cloud necessitates designing infrastructure to be ephemeral cattle, capable of self-healing in the event of instance failures. A topology control plane for mcrouter automatically and gracefully responds to unexpected changes in server capacity; instance loss is a non-issue.

Compute

Approximately half of all caching workloads at Pinterest are compute-bound (i.e. purely request throughput-bound). Successful optimizations in compute efficiency translate into the ability to downsize clusters without compromising serving capacity.

More precisely, compute efficiency for memcached is defined as the additional rate of requests that can be serviced by a single instance for each percentage point increase in instance CPU usage, without increasing request latency. In simpler terms, an optimization that improves compute efficiency is one that allows memcached to serve a higher request rate at lower CPU usage, without changing request latency characteristics.

At Pinterest, most workloads (including the distributed cache fleet) run on dedicated EC2 virtual machines. Many historical efficiency improvements stem from optimizations in the hardware layer itself, like migrating to different instance families or upgrading to newer generations of existing instance types. However, operating workloads on dedicated (virtualized) machines offers unique opportunities for optimizations at the hardware-software boundary.

Memcached is somewhat unique among stateful data systems at Pinterest in that it is the exclusive primary workload, with a static set of long-lived worker threads, on every EC2 instance on which it is deployed. This is in contrast to database workloads which might have, for example, multiple colocated processes for decoupled storage and serving layers. To this end, one simple but highly effective optimization is tuning process scheduling in order to request the kernel prioritize CPU time for memcached at the expense of deliberately withholding CPU time from other processes on the host, like monitoring daemons. This involves running memcached under a real-time scheduling policy, SCHED_FIFO, with a high priority — instructing the kernel to, effectively, allow memcached to monopolize the CPU by preempting (and thus deliberately starving) all non-realtime processes whenever a memcached thread becomes runnable.

$ sudo chrt — — fifo <priority> memcached …

Example invocation of memcached under a SCHED_FIFO real-time scheduling policy

This one-line change, after rollout to all compute-bound clusters, drove client-side P99 latency down by anywhere between 10% and 40%, in addition to eliminating spurious spikes in P99 and P999 latency across the board. Additionally, it afforded the ability to raise the steady-state operation CPU usage ceiling by 20% without introducing latency regressions. Ultimately, this shaved close to 10% off memcached’s total fleet-wide cost footprint.

Week-over-week comparison of client-side P99 cache latency for one service while real-time scheduling was rolled out to its corresponding dedicated memcached cluster

Ratio of time spent by memcached waiting for execution by the kernel versus wall clock time, before and after real-time scheduling was enabled (data is collected from schedstat in the /proc filesystem)

Stabilization of spurious latency spikes after real-time scheduling was enabled on a canary host (red-colored series)

Networking

There are a few key dimensions when considering networking performance:

  • Data bandwidth, packet throughput, and TCP connection limits. EC2 imposes hard limits on per-instance PPS, aggregate bandwidth, and TCP connections (though only when deployed in a security group with TCP ingress rules). Excess usage beyond these limits is reported by the Elastic Network Adapter (ENA) driver and accessible via ethtool. Confusingly, EC2 also expresses total NIC bandwidth capabilities in terms of burst loads rather than steady-state loads, thus requiring some degree of trial-and-error to determine the practical bandwidth ceiling for workloads like memcached with predictable network characteristics.
  • Connection latency and reliability. Is there a way to make initial TCP connections to memcached faster and more reliable, especially under burst scenarios where thousands of clients are simultaneously establishing connections?
  • Overhead associated with transport-layer features like TLS. Is there a way to reduce the encryption/decryption compute overhead of TLS? Additionally, is there a way to reduce the cost of the initial setup cost (i.e. TLS handshake)?

From a cloud consumer’s perspective, EC2-enforced network limits can and should effectively be considered inherent hardware limitations. Unfortunately, there is no mechanism to work around these limits other than horizontally scaling the fleet to reduce per-instance usage.

In Pinterest’s caching architecture, mcrouter is a universal routing proxy and the single application-facing entry point into the distributed caching tier. Each mcrouter instance (effectively, every individual host in a service cluster) creates a statically sized, long-lived TCP connection pool to every individual memcached server in a cluster. Connection pool sizes are deterministically derived from the number of logical cores available on the host system, typically ranging from 8 to 72 for canonical instance types. This results in upwards of tens of thousands of active established TCP connections per server host, and easily over a million total connections per server cluster — necessitating a strategy for maintaining minimal connection latency and connection reliability at scale.

TCP Fast Open (TFO) is a mechanism for reducing the latency overhead of establishing a TCP connection by optimizing away one RTT in an otherwise costly TCP 3WHS (3-way handshake), while also allowing eager transmission of early data during the handshake itself. While originally intended for end users on unreliable home and mobile networks connecting to remote edge servers, TFO has demonstrated tangible improvements in connection reliability in a closed cloud environment as well. Implementing TFO support in memcached reduced average TCP connection durations of successive sessions by ~10%, most prominently in connections established across an Availability Zone boundary.

Packets exchanged between client and server during TFO cookie setup and a subsequent TFO-initiated session with early data

Separately, raising the sysctl parameter value for net.core.somaxconn and associated listen backlog size in the userspace listen(2) callsite in memcached improved burst connection availability for high-throughput clusters. Previously, deploying a new memcached binary would cause spikes in ECONNREFUSED server errors caused by exhausted server-side TCP accept queues driven by thundering herds of simultaneous inbound connections from thousands of client mcrouter instances. A more generous listen backlog threshold reduced per-server downtime and fixed the brief but frequent SLO violations whenever a shared tenancy cluster was deployed.

Lastly, TLS plays an important role for in-transit data encryption between memcached and mcrouter, and it is enabled for 100% of cache traffic within Pinterest in order to comply with sitewide authentication, authorization, and auditing policies. Even with hardware-accelerated cryptography, TLS adds non-trivial initial and steady-state overhead, due to a post-connect TLS handshake and application-layer encryption/decryption during network I/O, respectively. TLS session resumption, after implementation in memcached, reduced fleet-wide client-side connection timeout rates by allowing reuse of previously cached TLS sessions. One avenue for tackling steady-state overhead is kernel TLS (kTLS) — a mechanism to offload the TLS record layer from userspace to the kernel, implemented either in software or offloaded to supported dedicated NIC hardware for completely transparent inline data encryption/decryption. TLS session resumption was upstreamed by Pinterest to memcached and is available in version 1.6.3 onward; kTLS is an ongoing and relatively experimental optimization area.

Future work

Infrastructure optimization is a critical objective for Pinterest that ultimately drives a more delightful experience for Pinners while reducing our own cloud cost footprint. We look forward to continuing to explore avenues for improving cache performance and efficiency at all layers of the stack, from application clients and routing proxies to the servers themselves. In the near term, we intend to continue evaluation of software kernel TLS, explore compatibility of memcached with newer generations of EC2 instance types for improved price-to-performance characteristics, and application/proxy-side software optimizations like in-flight compression for improved storage efficiency. We hope to additionally build an end-to-end automated performance regression testing framework to track the impact of these optimizations over time.

Thanks to the entire Storage and Caching team at Pinterest for supporting this work, especially Ankita Girish Wagh and Lianghong Xu.

Pinterest
Pinterest's profile on StackShare is not actively maintained, so the information here may be out of date.
Tools mentioned in article
Open jobs at Pinterest
Sr. Staff Software Engineer, Ads ML I...
San Francisco, CA, US; , CA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p>Creating a life you love also means finding a career that celebrates the unique perspectives and experiences that you bring. As you read through the expectations of the position, consider how your skills and experiences may complement the responsibilities of the role. We encourage you to think through your relevant and transferable skills from prior experiences.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p>Pinterest is one of the fastest growing online advertising platforms. Continued success depends on the machine-learning systems, which crunch thousands of signals in a few hundred milliseconds, to identify the most relevant ads to show to pinners. You’ll join a talented team with high impact, which designs high-performance and efficient ML systems, in order to power the most critical and revenue-generating models at Pinterest.</p> <p><strong>What you’ll do</strong></p> <ul> <li>Being the technical leader of the Ads ML foundation evolution movement to 2x Pinterest revenue and 5x ad performance in next 3 years.</li> <li>Opportunities to use cutting edge ML technologies including GPU and LLMs to empower 100x bigger models in next 3 years.&nbsp;</li> <li>Tons of ambiguous problems and you will be tasked with building 0 to 1 solutions for all of them.</li> </ul> <p><strong>What we’re looking for:</strong></p> <ul> <li>BS (or higher) degree in Computer Science, or a related field.</li> <li>10+ years of relevant industry experience in leading the design of large scale &amp; production ML infra systems.</li> <li>Deep knowledge with at least one state-of-art programming language (Java, C++, Python).&nbsp;</li> <li>Deep knowledge with building distributed systems or recommendation infrastructure</li> <li>Hands-on experience with at least one modeling framework (Pytorch or Tensorflow).&nbsp;</li> <li>Hands-on experience with model / hardware accelerator libraries (Cuda, Quantization)</li> <li>Strong communicator and collaborative team player.</li> </ul><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</p> <p><em><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found <a href="https://www.pinterestcareers.com/pinterest-life/" target="_blank">here</a>.</span></em></p></div><div class="title">US based applicants only</div><div class="pay-range"><span>$135,150</span><span class="divider">&mdash;</span><span>$278,000 USD</span></div></div></div><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require an accommodation during the job application process, please notify&nbsp;<a href="mailto:accessibility@pinterest.com">accessibility@pinterest.com</a>&nbsp;for support.</p></div>
Senior Staff Machine Learning Enginee...
San Francisco, CA, US; , CA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p>Creating a life you love also means finding a career that celebrates the unique perspectives and experiences that you bring. As you read through the expectations of the position, consider how your skills and experiences may complement the responsibilities of the role. We encourage you to think through your relevant and transferable skills from prior experiences.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p>We are looking for a highly motivated and experienced Machine Learning Engineer to join our team and help us shape the future of machine learning at Pinterest. In this role, you will tackle new challenges in machine learning that will have a real impact on the way people discover and interact with the world around them.&nbsp; You will collaborate with a world-class team of research scientists and engineers to develop new machine learning algorithms, systems, and applications that will bring step-function impact to the business metrics (recent publications <a href="https://arxiv.org/abs/2205.04507">1</a>, <a href="https://dl.acm.org/doi/abs/10.1145/3523227.3547394">2</a>, <a href="https://arxiv.org/abs/2306.00248">3</a>).&nbsp; You will also have the opportunity to work on a variety of exciting projects in the following areas:&nbsp;</p> <ul> <li>representation learning</li> <li>recommender systems</li> <li>graph neural network</li> <li>natural language processing (NLP)</li> <li>inclusive AI</li> <li>reinforcement learning</li> <li>user modeling</li> </ul> <p>You will also have the opportunity to mentor junior researchers and collaborate with external researchers on cutting-edge projects.&nbsp;&nbsp;</p> <p><strong>What you'll do:&nbsp;</strong></p> <ul> <li>Lead cutting-edge research in machine learning and collaborate with other engineering teams to adopt the innovations into Pinterest problems</li> <li>Collect, analyze, and synthesize findings from data and build intelligent data-driven model</li> <li>Scope and independently solve moderately complex problems; write clean, efficient, and sustainable code</li> <li>Use machine learning, natural language processing, and graph analysis to solve modeling and ranking problems across growth, discovery, ads and search</li> </ul> <p><strong>What we're looking for:</strong></p> <ul> <li>Mastery of at least one systems languages (Java, C++, Python) or one ML framework (Pytorch, Tensorflow, MLFlow)</li> <li>Experience in research and in solving analytical problems</li> <li>Strong communicator and team player. Being able to find solutions for open-ended problems</li> <li>8+ years working experience in the r&amp;d or engineering teams that build large-scale ML-driven projects</li> <li>3+ years experience leading cross-team engineering efforts that improves user experience in products</li> <li>MS/PhD in Computer Science, ML, NLP, Statistics, Information Sciences or related field</li> </ul> <p><strong>Desired skills:</strong></p> <ul> <li>Strong publication track record and industry experience in shipping machine learning solutions for large-scale challenges&nbsp;</li> <li>Cross-functional collaborator and strong communicator</li> <li>Comfortable solving ambiguous problems and adapting to a dynamic environment</li> </ul> <p>This position is not eligible for relocation assistance.</p> <p>#LI-SA1</p> <p>#LI-REMOTE</p><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</p> <p><em><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found <a href="https://www.pinterestcareers.com/pinterest-life/" target="_blank">here</a>.</span></em></p></div><div class="title">US based applicants only</div><div class="pay-range"><span>$158,950</span><span class="divider">&mdash;</span><span>$327,000 USD</span></div></div></div><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require an accommodation during the job application process, please notify&nbsp;<a href="mailto:accessibility@pinterest.com">accessibility@pinterest.com</a>&nbsp;for support.</p></div>
Staff Software Engineer, ML Training
San Francisco, CA, US; , CA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p>Creating a life you love also means finding a career that celebrates the unique perspectives and experiences that you bring. As you read through the expectations of the position, consider how your skills and experiences may complement the responsibilities of the role. We encourage you to think through your relevant and transferable skills from prior experiences.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p>The ML Platform team provides foundational tools and infrastructure used by hundreds of ML engineers across Pinterest, including recommendations, ads, visual search, growth/notifications, trust and safety. We aim to ensure that ML systems are healthy (production-grade quality) and fast (for modelers to iterate upon).</p> <p>We are seeking a highly skilled and experienced Staff Software Engineer to join our ML Training Infrastructure team and lead the technical strategy. The ML Training Infrastructure team builds platforms and tools for large-scale training and inference, model lifecycle management, and deployment of models across Pinterest. ML workloads are increasingly large, complex, interdependent and the efficient use of ML accelerators is critical to our success. We work on various efforts related to adoption, efficiency, performance, algorithms, UX and core infrastructure to enable the scheduling of ML workloads.</p> <p>You’ll be part of the ML Platform team in Data Engineering, which aims to ensure healthy and fast ML in all of the 40+ ML use cases across Pinterest.</p> <p><strong>What you’ll do:</strong></p> <ul> <li>Implement cost effective and scalable solutions to allow ML engineers to scale their ML training and inference workloads on compute platforms like Kubernetes.</li> <li>Lead and contribute to key projects; rolling out GPU sharing via MIGs and MPS , intelligent resource management, capacity planning, fault tolerant training.</li> <li>Lead the technical strategy and set the multi-year roadmap for ML Training Infrastructure that includes ML Compute and ML Developer frameworks like PyTorch, Ray and Jupyter.</li> <li>Collaborate with internal clients, ML engineers, and data scientists to address their concerns regarding ML development velocity and enable the successful implementation of customer use cases.</li> <li>Forge strong partnerships with tech leaders in the Data and Infra organizations to develop a comprehensive technical roadmap that spans across multiple teams.</li> <li>Mentor engineers within the team and demonstrate technical leadership.</li> </ul> <p><strong>What we’re looking for:</strong></p> <ul> <li>7+ years of experience in software engineering and machine learning, with a focus on building and maintaining ML infrastructure or Batch Compute infrastructure like YARN/Kubernetes/Mesos.</li> <li>Technical leadership experience, devising multi-quarter technical strategies and driving them to success.</li> <li>Strong understanding of High Performance Computing and/or and parallel computing.</li> <li>Ability to drive cross-team projects; Ability to understand our internal customers (ML practitioners and Data Scientists), their common usage patterns and pain points.</li> <li>Strong experience in Python and/or experience with other programming languages such as C++ and Java.</li> <li>Experience with GPU programming, containerization, orchestration technologies is a plus.</li> <li>Bonus point for experience working with cloud data processing technologies (Apache Spark, Ray, Dask, Flink, etc.) and ML frameworks such as PyTorch.</li> </ul> <p>This position is not eligible for relocation assistance.</p> <p>#LI-REMOTE</p> <p><span data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;#LI-AH2&quot;}" data-sheets-userformat="{&quot;2&quot;:14464,&quot;10&quot;:2,&quot;14&quot;:{&quot;1&quot;:2,&quot;2&quot;:0},&quot;15&quot;:&quot;Helvetica Neue&quot;,&quot;16&quot;:12}">#LI-AH2</span></p><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</p> <p><em><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found <a href="https://www.pinterestcareers.com/pinterest-life/" target="_blank">here</a>.</span></em></p></div><div class="title">US based applicants only</div><div class="pay-range"><span>$135,150</span><span class="divider">&mdash;</span><span>$278,000 USD</span></div></div></div><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require an accommodation during the job application process, please notify&nbsp;<a href="mailto:accessibility@pinterest.com">accessibility@pinterest.com</a>&nbsp;for support.</p></div>
Distinguished Engineer, Frontend
San Francisco, CA, US; , US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p>Creating a life you love also means finding a career that celebrates the unique perspectives and experiences that you bring. As you read through the expectations of the position, consider how your skills and experiences may complement the responsibilities of the role. We encourage you to think through your relevant and transferable skills from prior experiences.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p>As a Distinguished Engineer at Pinterest, you will play a pivotal role in shaping the technical direction of our platform, driving innovation, and providing leadership to our engineering teams. You'll be at the forefront of developing cutting-edge solutions that impact millions of users.</p> <p><strong>What you’ll do:</strong></p> <ul> <li>Advise executive leadership on highly complex, multi-faceted aspects of the business, with technological and cross-organizational impact.</li> <li>Serve as a technical mentor and role model for engineering teams, fostering a culture of excellence.</li> <li>Develop cutting-edge innovations with global impact on the business and anticipate future technological opportunities.</li> <li>Serve as strategist to translate ideas and innovations into outcomes, influencing and driving objectives across Pinterest.</li> <li>Embed systems and processes that develop and connect teams across Pinterest to harness the diversity of thought, experience, and backgrounds of Pinployees.</li> <li>Integrate velocity within Pinterest; mobilizing the organization by removing obstacles and enabling teams to focus on achieving results for the most important initiatives.</li> </ul> <p>&nbsp;<strong>What we’re looking for:</strong>:</p> <ul> <li>Proven experience as a distinguished engineer, fellow, or similar role in a technology company.</li> <li>Recognized as a pioneer and renowned technical authority within the industry, often globally, requiring comprehensive expertise in leading-edge theories and technologies.</li> <li>Deep technical expertise and thought leadership that helps accelerate adoption of the very best engineering practices, while maintaining knowledge on industry innovations, trends and practices.</li> <li>Ability to effectively communicate with and influence key stakeholders across the company, at all levels of the organization.</li> <li>Experience partnering with cross-functional project teams on initiatives with significant global impact.</li> <li>Outstanding problem-solving and analytical skills.</li> </ul> <p>&nbsp;</p> <p>This position is not eligible for relocation assistance.</p> <p>&nbsp;</p> <p>#LI-REMOTE</p> <p>#LI-NB1</p><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</p> <p><em><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found <a href="https://www.pinterestcareers.com/pinterest-life/" target="_blank">here</a>.</span></em></p></div><div class="title">US based applicants only</div><div class="pay-range"><span>$242,029</span><span class="divider">&mdash;</span><span>$498,321 USD</span></div></div></div><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>Pinterest is an equal opportunity employer and makes employment decisions on the basis of merit. We want to have the best qualified people in every job. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic under federal, state, or local law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require an accommodation during the job application process, please notify&nbsp;<a href="mailto:accessibility@pinterest.com">accessibility@pinterest.com</a>&nbsp;for support.</p></div>
You may also like