Improving Distributed Caching Performance and Efficiency at Pinterest

348
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

Kevin Lin | Software Engineer, Storage and Caching


Introduction

Pinterest’s distributed caching system, built on top of open source technologies memcached and mcrouter, is a critical component of the production infrastructure stack. Pinterest’s cache-as-a-service platform is responsible for driving down application latency across the board, reducing the overall cloud cost footprint, and ensuring adherence to strict sitewide availability targets.

Today, Pinterest’s memcached fleet spans over 5000 EC2 instances across a variety of instance types optimized along compute, memory, and storage dimensions. Collectively, the fleet serves up to ~180 million requests per second and ~220 GB/s of network throughput over a ~460 TB active in-memory and on-disk dataset, partitioned among ~70 distinct clusters.

As a core driver of reduced sitewide latency, the distributed caching tier is subject to stringent performance and latency requirements. Additionally, a key consequence of the sheer size of the fleet is that even small efficiency optimizations have an outsized impact on the total service cost footprint. Several years of operational experience running memcached at scale in production have provided unique insight into practical optimizations for driving improved performance and efficiency across the entire caching stack.

In this article, we will share some context on the observability and performance testing tools that enable optimization exploration work, followed by a deep dive into practical optimizations currently running in our production environment along dimensions of hardware selection strategy, compute efficiency, and networking performance.

High-level description of the available surface area for performance optimization for memcached running on virtual machines in public cloud environments

Monitoring, observability, and evaluation

All performance optimization efforts start with precise quantitative measurement and a structured, reproducible mechanism for generating workloads for isolated evaluation.

Critical monitoring prerequisites for all the performance evaluation conducted over the years include:

  • Server-side metrics for request throughput, network throughput, resource utilization, and hardware-level parameters (NIC statistics like per-queue packet throughput and EC2 allowance exhaustion, disk response times and in-flight I/O requests, etc.)
  • Client-side metrics for cache request percentile latency, timeout and error rates, and per-server availability (SLIs), as well as top-level application performance indicators like service RPC P99 response time

Pinterest leverages both synthetic load generation and production shadow traffic to evaluate the impact of tuning and optimizations. Historically, synthetic benchmarking has been useful for detecting performance regressions or improvements under maximum load, while shadow traffic evaluation has been more reflective of server resource utilization and overall performance under a real workload at scale.

  • Synthetic load generation: memtier_benchmark is an open source tool capable of generating a synthetic load against a memcached cluster with configurable parameters for the number of concurrent clients and threads, read/write ratio, data sizes, and transport mechanism (plaintext or TLS).
  • Production shadow traffic: mcrouter is an open source memcache-protocol routing proxy deployed as a client-side sidecar in the Pinterest fleet. It provides building blocks to design transparent shadow traffic routing policies with configurable traffic percentages and source/target cluster(s), allowing for flexible dark traffic experimentation across a variety of workload classes.

Together, these tools permit high-signal performance evaluation with zero or minimal impact to critical-path production traffic.

Performance and efficiency

Cloud hardware

Distributed caching at Pinterest serves a diverse array of workloads. In general, each class of workload can be categorized along the following high-level dimensions:

  • Throughput (compute)
  • Data volume (memory and/or disk capacity)
  • Data bandwidth (network and compute)
  • Latency requirement (compute)

While memcached can be arbitrarily horizontally scaled in and out to address a particular cluster’s bottleneck, vertically scaling individual hardware dimensions allows for greater cost efficiency for specific workloads. In practice, this entails standardization on a fixed pool of EC2 instance types optimized for each workload class.

Workload profile: Moderate throughput, moderate data volume

EC2 instance family: r5

Rationale: r5 family instances offer a vCPU-DRAM ratio that works well for most vanilla cache use cases at Pinterest. This instance type is considered the “baseline” against which others are evaluated.

Workload profile: High throughput, low data volume

EC2 instance family: c5

Rationale: c5 family instances are more cost efficient for use cases that would otherwise slot into the r5 type but hold significantly less memory. Maintaining the same vCPU count as its r5 counterpart allows it to serve the same throughput volume at a lower overall cost.

Workload profile: High data volume, relaxed latency requirement

EC2 instance family: r5d

Rationale: r5d family instances are functionally equivalent to r5 family instances but with an instance-colocated NVMe SSD used by extstore for secondary storage. r5d is cost efficient for clusters with high data volume such that there are tangible improvements to hit rate as data is written to disk. Due to the slower disk (relative to i3 family instances), higher tail latency is expected.

Workload profile: Massive data volume, relaxed latency requirement

EC2 instance family: i3 and i3en

Rationale: i3 and i3en family instances ship with a fast and sizable instance-colocated disk, which tangibly increases extstore performance for workloads with a very high ratio of working data on disk relative to DRAM. Additionally, they offer comparable memory capacity to r5 series instances, which reduces extstore thrashing by maintaining a reasonable DRAM to disk usage ratio.

EC2 instance type distribution for the Pinterest memcached fleet

In particular, using extstore to expand storage capacity beyond DRAM into a local NVMe flash disk tier increases per-instance storage efficiency by up to several orders of magnitude, and it reduces the associated cluster cost footprint proportionally. EC2’s storage-optimized instance types provide locally attached solid state drives capable of high random IOPS and R/W throughput, allowing onboarding of extstore use cases with massive data volumes and high request throughput without compromising tail latency.

The introduction of different shapes of storage-optimized EC2 instance types to the fleet (in particular, the lower-tier variants of the i3en instance family containing multiple independent disks per instance) further drives down costs while offering improvements in I/O performance and cost efficiency. Pinterest configures these instances with Linux software RAID at level RAID0 to combine multiple hardware block devices into a single, logical disk for userspace consumption. By striping reads and writes fairly across two disks, RAID0 doubles maximum theoretical I/O throughput with a best-case two-fold reduction in effective disk response time at the cost of a doubled MTTF. This increased hardware performance for extstore at the expense of an increased theoretical failure rate is a highly worthwhile tradeoff. Operating workloads on a public cloud necessitates designing infrastructure to be ephemeral cattle, capable of self-healing in the event of instance failures. A topology control plane for mcrouter automatically and gracefully responds to unexpected changes in server capacity; instance loss is a non-issue.

Compute

Approximately half of all caching workloads at Pinterest are compute-bound (i.e. purely request throughput-bound). Successful optimizations in compute efficiency translate into the ability to downsize clusters without compromising serving capacity.

More precisely, compute efficiency for memcached is defined as the additional rate of requests that can be serviced by a single instance for each percentage point increase in instance CPU usage, without increasing request latency. In simpler terms, an optimization that improves compute efficiency is one that allows memcached to serve a higher request rate at lower CPU usage, without changing request latency characteristics.

At Pinterest, most workloads (including the distributed cache fleet) run on dedicated EC2 virtual machines. Many historical efficiency improvements stem from optimizations in the hardware layer itself, like migrating to different instance families or upgrading to newer generations of existing instance types. However, operating workloads on dedicated (virtualized) machines offers unique opportunities for optimizations at the hardware-software boundary.

Memcached is somewhat unique among stateful data systems at Pinterest in that it is the exclusive primary workload, with a static set of long-lived worker threads, on every EC2 instance on which it is deployed. This is in contrast to database workloads which might have, for example, multiple colocated processes for decoupled storage and serving layers. To this end, one simple but highly effective optimization is tuning process scheduling in order to request the kernel prioritize CPU time for memcached at the expense of deliberately withholding CPU time from other processes on the host, like monitoring daemons. This involves running memcached under a real-time scheduling policy, SCHED_FIFO, with a high priority — instructing the kernel to, effectively, allow memcached to monopolize the CPU by preempting (and thus deliberately starving) all non-realtime processes whenever a memcached thread becomes runnable.

$ sudo chrt — — fifo <priority> memcached …

Example invocation of memcached under a SCHED_FIFO real-time scheduling policy

This one-line change, after rollout to all compute-bound clusters, drove client-side P99 latency down by anywhere between 10% and 40%, in addition to eliminating spurious spikes in P99 and P999 latency across the board. Additionally, it afforded the ability to raise the steady-state operation CPU usage ceiling by 20% without introducing latency regressions. Ultimately, this shaved close to 10% off memcached’s total fleet-wide cost footprint.

Week-over-week comparison of client-side P99 cache latency for one service while real-time scheduling was rolled out to its corresponding dedicated memcached cluster

Ratio of time spent by memcached waiting for execution by the kernel versus wall clock time, before and after real-time scheduling was enabled (data is collected from schedstat in the /proc filesystem)

Stabilization of spurious latency spikes after real-time scheduling was enabled on a canary host (red-colored series)

Networking

There are a few key dimensions when considering networking performance:

  • Data bandwidth, packet throughput, and TCP connection limits. EC2 imposes hard limits on per-instance PPS, aggregate bandwidth, and TCP connections (though only when deployed in a security group with TCP ingress rules). Excess usage beyond these limits is reported by the Elastic Network Adapter (ENA) driver and accessible via ethtool. Confusingly, EC2 also expresses total NIC bandwidth capabilities in terms of burst loads rather than steady-state loads, thus requiring some degree of trial-and-error to determine the practical bandwidth ceiling for workloads like memcached with predictable network characteristics.
  • Connection latency and reliability. Is there a way to make initial TCP connections to memcached faster and more reliable, especially under burst scenarios where thousands of clients are simultaneously establishing connections?
  • Overhead associated with transport-layer features like TLS. Is there a way to reduce the encryption/decryption compute overhead of TLS? Additionally, is there a way to reduce the cost of the initial setup cost (i.e. TLS handshake)?

From a cloud consumer’s perspective, EC2-enforced network limits can and should effectively be considered inherent hardware limitations. Unfortunately, there is no mechanism to work around these limits other than horizontally scaling the fleet to reduce per-instance usage.

In Pinterest’s caching architecture, mcrouter is a universal routing proxy and the single application-facing entry point into the distributed caching tier. Each mcrouter instance (effectively, every individual host in a service cluster) creates a statically sized, long-lived TCP connection pool to every individual memcached server in a cluster. Connection pool sizes are deterministically derived from the number of logical cores available on the host system, typically ranging from 8 to 72 for canonical instance types. This results in upwards of tens of thousands of active established TCP connections per server host, and easily over a million total connections per server cluster — necessitating a strategy for maintaining minimal connection latency and connection reliability at scale.

TCP Fast Open (TFO) is a mechanism for reducing the latency overhead of establishing a TCP connection by optimizing away one RTT in an otherwise costly TCP 3WHS (3-way handshake), while also allowing eager transmission of early data during the handshake itself. While originally intended for end users on unreliable home and mobile networks connecting to remote edge servers, TFO has demonstrated tangible improvements in connection reliability in a closed cloud environment as well. Implementing TFO support in memcached reduced average TCP connection durations of successive sessions by ~10%, most prominently in connections established across an Availability Zone boundary.

Packets exchanged between client and server during TFO cookie setup and a subsequent TFO-initiated session with early data

Separately, raising the sysctl parameter value for net.core.somaxconn and associated listen backlog size in the userspace listen(2) callsite in memcached improved burst connection availability for high-throughput clusters. Previously, deploying a new memcached binary would cause spikes in ECONNREFUSED server errors caused by exhausted server-side TCP accept queues driven by thundering herds of simultaneous inbound connections from thousands of client mcrouter instances. A more generous listen backlog threshold reduced per-server downtime and fixed the brief but frequent SLO violations whenever a shared tenancy cluster was deployed.

Lastly, TLS plays an important role for in-transit data encryption between memcached and mcrouter, and it is enabled for 100% of cache traffic within Pinterest in order to comply with sitewide authentication, authorization, and auditing policies. Even with hardware-accelerated cryptography, TLS adds non-trivial initial and steady-state overhead, due to a post-connect TLS handshake and application-layer encryption/decryption during network I/O, respectively. TLS session resumption, after implementation in memcached, reduced fleet-wide client-side connection timeout rates by allowing reuse of previously cached TLS sessions. One avenue for tackling steady-state overhead is kernel TLS (kTLS) — a mechanism to offload the TLS record layer from userspace to the kernel, implemented either in software or offloaded to supported dedicated NIC hardware for completely transparent inline data encryption/decryption. TLS session resumption was upstreamed by Pinterest to memcached and is available in version 1.6.3 onward; kTLS is an ongoing and relatively experimental optimization area.

Future work

Infrastructure optimization is a critical objective for Pinterest that ultimately drives a more delightful experience for Pinners while reducing our own cloud cost footprint. We look forward to continuing to explore avenues for improving cache performance and efficiency at all layers of the stack, from application clients and routing proxies to the servers themselves. In the near term, we intend to continue evaluation of software kernel TLS, explore compatibility of memcached with newer generations of EC2 instance types for improved price-to-performance characteristics, and application/proxy-side software optimizations like in-flight compression for improved storage efficiency. We hope to additionally build an end-to-end automated performance regression testing framework to track the impact of these optimizations over time.

Thanks to the entire Storage and Caching team at Pinterest for supporting this work, especially Ankita Girish Wagh and Lianghong Xu.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Site Reliability Engineer, Infrastruc...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The Site Reliability Engineering organization at Pinterest is accountable for ensuring overall Pinterest availability as well as enhancing Engineering teams’ capability to design, build and operate robust systems at scale

Pinterest’s applications and infrastructure that handle billions of monthly page views and petabytes of data as Pinterest continues to grow and scale. As a Pinterest SRE, you will design and build systems, platforms, tools, frameworks and methodologies to assure the reliability of our large-scale distributed systems.

What You’ll Do:

  • Develop software solutions to enable relailbity and operability of large scale distributed systems handling petabytes of data and serving 
  • Build a deep understanding of how Pinterst’s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
  • Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
  • Build meaningful, insightful and actionable SLIs
  • Automate critical portions of Pinterest’s engineering processes, to minimize risk and maximize the speed of innovation
  • Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world

What We’re Looking For:

  • Strong knowledge of Linux/Unix/BSD internals and experience working with open source software (e.g. MySQL, Hadoop, Envoy, HAProxy, Nginx)
  • Experience with technologies such as ElasticSearch, ZooKeeper, HBase, Hadoop, Memcache and Kafka with a focus on reliability, automation, operability and performance
  • 2+ years of experience with programming languages (Python, Golang, Ruby, etc.)
  • Infrastructure as code a plus (e.g. Terraform, Puppet, Chef, Ansible, Salt, Fabric, Docker, etc)
  • Bonus points if experienced with deploying web apps to cloud infrastructure (AWS, etc.) and working with distributed, service-oriented architecture 

#LI-SG1

iOS Engineer, Metrics Quality
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is a metric-driven company, not only do we use metrics to make our decisions, but our state of the art machine learning depends on accurate metrics.  This role is dedicated to making sure that the metrics on the iOS client are accurate.  We are building the core of this team in Toronto, where you'll be working with San Francisco teammates on larger projects. 

What you’ll do

  • Create tooling and education allowing for client engineers to instrument metric logging effectively.
  • Work with data engineers to create, manage, and investigate early metric anomaly alerts.
  • Certify company critical metrics are accurate and ensure they will remain accurate with a variety of testing.
  • Creating new accurate metrics to allow a deeper understanding of the user’s behavior and needs.

What we’re looking for:

  • 4+ years of experience working in iOS development.
  • Strong experience in writing reliable, maintainable, and impactful code that will be used by many other engineers.
  • Enthusiasm for collaborating with a diverse array of partners, including product managers and engineers across a variety of platforms and teams, data engineers, release services, and analysts.
  • Enthusiasm for “behind the scenes” work that is a high priority for Pinterest, but will not be seen by users.
  • Strong communication skills and confidence working with developers in a different time zone (this team will frequently work with SF engineers).
  • Experience with relational databases and query languages.
  • Nice to have: Experience maintaining and improving UI tests.
  • Nice to have: Experience building ways to measure user behavior on iOS applications.
  • Nice to have: Experience evaluating user metrics to understand app usage.
  • Nice to have: Experience working with developers as customers to address the needs of engineers.

#LI-AG1

Fullstack Engineer, Growth Platform
Dublin, IE

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We are looking for founding members of our Growth Engineering team in Dublin. You’ll be a pivotal part of the global Growth organization and its mission, while helping to build the inaugural branch of the Growth organization in Dublin. Specifically, you’ll join the Growth Platform team as a full stack engineer whose mission is to accelerate global growth by building tools, infrastructure, and platforms to improve user experience (such as tools to automate, personalize and simplify the building of user experiences across the world). You’ll get to work day-to-day with engineering and cross-functional teams while expanding your technical expertise on all sides of the projects. 

What you’ll do:

  • Impact & Mission Join a small, but rapidly-growing engineering group in Pinterest's largest international office to establish infrastructure and platforms to achieve and foster global growth
  • Pinterest Global Growth Build Pinterest growth, experience and marketing tools that increase our product relevance and understanding around the world
  • Feature & Platform Development Work across all parts of the stack and be flexible working with different technologies
    • Improve the experience framework backend to handle more types of use cases, while serving hundreds of millions of requests daily
    • Build new open-ended self-serve experiences that affect end-users and can be easily configured and deployed by marketing partners
    • Scale the usefulness of the experience framework tool by making it easier to use and understand by different types of users
    • Determine the forward direction for our Pinner Communications Platform which helps us target and reach out to creators, advertisers and pinners with personalized messaging
  • Collaboration 
    • Work in dynamic and diverse environments alongside engineering, product, international experts, marketing and growth teams
  • Evangelization across engineering to implement and adopt best practices to simplify our codebase and promote growth

What we’re looking for:

  • 3+ years of full stack development building successful products and/or systems.
  • Experience/familiarity with Javascript, Python, React, mysql, bash/scripting or similar
  • An interest in building tools for technical and non-technical users that improve the daily lives and effectiveness of these individuals
  • Self-driven and openness to learn quickly, explore, flexibility without being afraid to dig deep
iOS Engineer, Shopping Product
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Shopping is at the core of Pinterest’s mission to help people create a life they love. The shopping product team at Pinterest is inventing a brand new, more visual and personalized shopping experience for 350M+ users worldwide. The team is responsible for inspiring Pinners to shop, helping them find the best product and providing seamless checkout experience. 

You’ll be responsible for building an iOS application that enables Pinners to create the life they love with product discovery and decision experiences that guide from inspiration to purchase. Working closely with the design team, you’ll build beautiful iOS shopping applications for phones and tablets.

What you'll do:

  • Build features to power the future of Shopping on Pinterest.
  • Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B tests, to architecting and building solutions that can scale to support millions of users.
  • Work with cross functional peers (PM, Design) to define the product roadmap
  • Analyze and visualize data to drive product insights and to inform our decisions.
  • Contribute best-in-class programming skills to develop highly innovative consumer-facing mobile products

What we're looking for:

  • 2+ years of industry iOS application development experience 
  • Experience in building consumer facing products on iOS platforms for a rapidly iterating product
  • Holistic knowledge and passion for the iOS platform
  • Strong command of data that could help improve the user experience
  • Strong communication skills and great product intuition

 

#LI-AG1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like