Scaling Kubernetes with Assurance at Pinterest

1,054
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Anson Qian | Software Engineer, Cloud Runtime


Introduction

It has been more than a year since we shared our Kubernetes Journey at Pinterest. Since then, we have delivered many features to facilitate customer adoption, ensure reliability and scalability, and build up operational experience and best practices.

In general, Kubernetes platform users gave positive feedback. Based on our user survey, the top three benefits shared by our users are reducing the burden of managing compute resources, better resource and failure isolation, and more flexible capacity management.

By the end of 2020, we orchestrated 35K+ pods with 2500+ nodes in our Kubernetes clusters — supporting a wide range of Pinterest businesses — and the organic growth is still rocket high.

2020 in a Short Story

As user adoption grows, the variety and number of workloads increases. It requires the Kubernetes platform to be more scalable in order to catch up with the increasing load from workload management, pods scheduling and placement, and node allocation and deallocation. As more business critical workloads onboard the Kubernetes platform, the expectations on platform reliability naturally rise to a new level.

Platform-wide outage did happen. In early 2020, one of our clusters experienced a sudden spike of pods creation (~3x above planned capacity), causing the cluster autocalor to bring up 900 nodes to accommodate the demand. The kube-apiserver started to first experience latency spikes and increased error rate, and then get Out of Memory (OOM) killed due to resource limit. The unbound retry from Kubelets resulted in a 7x jump on kube-apiserver load. The burst of writes caused etcd to reach its total data size limit and start rejecting all write requests, and the platform lost availability in terms of workload management. In order to mitigate the incident, we had to perform etcd operations like compacting old revisions, defragmenting excessive spaces, and disabling alarms to recover it. In addition, we had to temporarily scale up Kubernetes master nodes that host kube-apiserver and etcd to reduce resource constraint.

Figure 1: Kubernetes API Server Latency Spikes

Later in 2020, one of the infra components had a bug in kube-apiserver integration that generated a spike of expensive queries (listing all pods and nodes) to kube-apiserver. This caused the Kubernetes master node resource usage spikes, and kube-apiserver entered OOMKilled status. Luckily the problematic component was discovered and rolled back shortly afterwards. But during the incident, the platform performance suffered from degrationation, including delayed workload execution and stale status serving.

Figure 2: Kubernetes API Server OOMKilled

Getting Ready for Scale

We continue to reflect on our platform governance, resilience, and operability throughout our journey, especially when incidents happen and hit hard on our weakest spots. With a nimble team of limited engineering resources, we had to dig deep to find out root causes, identify low hanging fruits, and prioritize solutions based on return vs. cost. Our strategy for dealing with the complex Kubernetes ecosystem is to try our best to minimize divergence from what’s provided by the community and contribute back to the community, but never rule out the option of writing our own in house components.

Figure 3: Pinterest Kubernetes Platform Architecture (blue is in-house, green is open source)

Governance

Resource Quota Enforcement

Kubernetes already provides resource quotas management to ensure no namespace can request or occupy unbounded resources in most dimensions: pods, cpu, memory, etc. As our previous incident mentioned, a surge of pod creation in a single namespace could overload kube-apiserver and cause cascading failure. It is key to have resource usage bounded in every namespace in order to ensure stability.

One challenge we faced is that enforcing resource quota in every namespace implicitly requires all pods and containers to have resource requests and limits specified. In Pinterest Kubernetes platform, workloads in different namespaces are owned by different teams for different projects, and platform users configure their workload via Pinterest CRD. We achieved that by adding default resource requests and limits for all pods and containers in the CRD transformation layer. In addition, we also rejected any pod specification without resource requests and limits in the CRD validation layer.

Another challenge we overcame was to streamline quota management across teams and organizations. To safely enable resource quota enforcement, we look at historical resource usage, add 20% headroom on top of peak value, and set it as the initial value for resource quota for every project. We created a cron job to monitor quota usage and send business hour alerts to project owning teams if their project usage is approaching a certain limit. This encourages project owners to do a better job of capacity planning and request a resource quota change. The resource quota change gets manually reviewed and automatically deployed after sign-off.

Client Access Enforcement

We enforce all KubeAPI clients to follow the best practices Kubernetes already provides:

Controller Framework

Controller framework provides a shareable cache for optimizing read operations, which leverages informer-reflector-cache architecture. Informers are set up to list and watch objects of interest from the kube-apiserver. Reflector reflects object changes to the underlying Cache and propagates out watched events to event handlers. Multiple components inside the same controller can register event handlers for OnCreate, OnUpdate, and OnDelete events from Informers and fetch objects from Cache instead of Kube-apiserver directly. Therefore, it reduces the chance of making unnecessary and redundant calls.

Figure 4: Kubernetes Controller Framework

Rate Limiting

Kubernetes API clients are usually shared among different controllers, and API calls are made from different threads. Kubernetes ships its API client along with a token bucket rate limiter that supports configurable QPS and bursts. API calls that burst beyond threshold will be throttled so that a single controller will not jam the kube-apiserver bandwidth.

Shared Cache

In addition to the kube-apiserver built-in cache that comes with the controller framework, we added another informer based write through cache layer in the platform API. This is to prevent unnecessary read calls hard hitting the kube-apiserver. The server side cache reuse also avoided thick clients in application code.

For kube-apiserver access from applications, we enforce all requests to go through the platform API to leverage shared care and assign security identity for access control and flow control. For kube-apiserver access from workload controllers, we enforce that all controllers implement based on control framework with rate limiting.

Resilience

Hardening Kubelet

One key reason why Kubernetes’ control plane entered cascading failure is that the legacy reflector implementation had unbounded retry when handling errors. Such imperfections can be exaggerated, especially when the API server is OOMKilled, which can easily cause a synchronization of reflectors across the cluster.

To resolve this issue, we worked very closely with the community by reporting issues, discussing solutions, and finally getting PRs (1, 2) reviewed and merged. The idea is to add exponential backoff with jitter reflector’s ListWatch retry logic, so the kubelet and other controllers will not try to hammer the kube-apiserver upon kube-apiserver overload and request failures. This resilience improvement is useful in general, but we found it critical on the kubelet side as the number of nodes and pods increases in the Kubernetes cluster.

Tuning Concurrent Requests

The more nodes we manage, the faster workloads are created and destroyed, and the larger the API call QPS server needs to handle. We first increased the maximum concurrent API call settings for both mutating and non-mutating operations based on estimated workloads. These two settings will enforce that the amount of API calls processed doesn’t exceed the configured number and therefore keeps CPU and memory consumption of kube-apiserver at a certain threshold.

Inside Kubernetes’s chain of API request handling, every request will pass a group of filters as the very first step. The filter chain is where max inflight API calls are enforced. For API calls burst to more than the configured threshold, a ‘too many requests” (429) response will be returned to clients to trigger proper retries. As future work, we plan to investigate more on EventRateLimit features with more fine-grained admission control and provide better quality of services.

Caching More Histories

Watch cache is a mechanism inside kube-apiserver that caches past events of each type of resource in a ring buffer in order to serve watch calls from a particular version with best effort. The larger the caches are, the more events can be retained in the server and are more likely to seamlessly serve event streams to clients in case of connection broken. Given this fact, we also improved the target RAM size of kube-apiserver, which internally is finally transferred to the watch cache capacity based on heuristics for serving more robust event streams. Kube-apiserver provides more detailed ways to configure fine grained watch cache size, which can be further leveraged for specific caching requirements.

Figure 5: Kubernetes Watch Cache

Operability

Observability

Aiming to reduce incident detection and mitigation time, we devote efforts continuously to improve observability of Kubernetes control planes. The challenge is to balance failure coverage and signal sensitivity. For existing Kubernetes metrics, we triage and pick important ones to monitor and/or alert so we can more proactively identify issues. In addition, we instrument kube-apiserver to cover more detailed areas in order to quickly narrow down the root cause. Finally, we tune alert statistics and thresholds to reduce noise and false alarms.

At a high level, we monitor kube-apiserver load by looking at QPS and concurrent requests, error rate, and request latency. We can breakdown the traffic by resource types, request verbs, and associated service accounts. For expensive traffic like listing, we also measure request payload by object counts and bytes size, since they can easily overload kube-apiserver even with small QPS. Lastly we monitor etcd watch events processing QPS and delayed processing count as important server performance indicators.

Figure 6: Kubernetes API calls by type

Debuggability

In order to better understand the Kubernetes control plane performance and resource consumption, we also built etcd data storage analysis tool using boltdb library and flamegraph to visualize data storage breakdown. The results of data storage analysis provide insights for platform users to optimize usage.

Figure 7: Etcd Data Usage Per Key Space

In addition, we enabled golang profiling pprof and visualized heap memory footprint. We were able to quickly identify the most resource intensive code paths and request patterns, e.g. transforming response objects upon list resource calls. Another big caveat we found as part of kube-apiserver OOM investigation is that page cache used by kube-apiserver is counted towards a cgroup’s memory limit, and anonymous memory usage can steal page cache usage for the same cgroup. So even if kube-apiserver only has 20GB heap memory usage, the entire cgroup can see 200GB memory usage hitting the limit. While the current kernel default setting is not to proactively reclaim assigned pages for efficient re-use, we are currently looking at setup monitoring based on memory.stat file and force cgroup to reclaim as many pages reclaimed as possible if memory usage is approaching limit.

Figure 8: Kubernetes API Server Memory Profiling

Conclusion

With our governance, resilience, and operability efforts, we are able to significantly reduce sudden usage surges of compute resources, control plane bandwidth, and ensure the stability and performance of the whole platform. The kube-apiserver QPS (mostly read) is reduced by 90% after optimization rollout (as graph shown below), which makes kube-apiserver usage more stable, efficient, and robust. The deep knowledge of Kubernetes’ internals and additional insights we gained will enable the team to do a better job of system operation and cluster maintenance.

Figure 9: Kube-apiserver QPS Reduction After Optimization Rollout

Here are some key takeaways that can hopefully help your next journey of solving Kubernetes scalability and reliability problem:

  1. Diagnose problems to get at their root causes. Focus on the “what is” before deciding “what to do about it.” The first step of solving problems is to understand what the bottleneck is and why. If you get to the root cause, you are halfway to the solution.
  2. It is almost always worthwhile to first look into small incremental improvements rather than immediately commit to radical architecture change. This is important, especially when you have a nimble team.
  3. Make data-driven decisions when you plan or prioritize the investigation and fixes. The right telemetry can help make better decisions on what to focus and optimize first.
  4. Critical infrastructure components should be designed with resilience in mind. Distributed systems are subject to failures, and it is best to always prepare for the worst. Correct guardrails can help prevent cascading failures and minimize the blast radius.

Looking Forward

Federation

As our scale grows steadily, single cluster architecture has become insufficient in supporting the increasing amount of workloads that try to onboard. After ensuring an efficient and robust single cluster environment, enabling our compute platform to scale horizontally is our next milestone moving forward. By leveraging a federation framework, we aim at plugging new clusters into the environment with minimum operation overhead while keeping the planform interface steady to end users. Our federated cluster environment is currently under development, and we look forward to the additional possibilities it opens up once productized.

Capacity Planning

Our current approach of resource quota enforcement is a simplified and reactive way of capacity planning. As we onboard user workloads and system components, the platform dynamics change and project level or cluster wide capacity limit could be out of date. We want to explore proactive capacity planning with forecasting based on historical data, growth trajectory, and a sophisticated capacity model that can cover not only resource quota but also API quota. We expect more proactive and accurate capacity planning can prevent the platform from over-committing and under-delivering.

Acknowledgements

Many engineers at Pinterest helped scale the Kubernetes platform to catch up with business growth. Besides the Cloud Runtime team — June Liu, Harry Zhang, Suli Xu, Ming Zong, and Quentin Miao who worked hard to achieve the scalable and stable compute platform as we have for today, Balaji Narayanan, Roberto Alcala and Rodrigo Menezes who lead our Site Reliability Engineering (SRE) effort, have worked together on ensuring the solid foundation of the compute platform. Kalim Moghul and Ryan Albrecht who lead the Capacity Engineering effort, have contributed to the project identity management and system level profiling. Cedric Staub and Jeremy Krach, who lead the Security Engineering effort, have maintained a high standard such that our workloads can run securely in a multi-tenanted platform. Lastly, our platform users Dinghang Yu, Karthik Anantha Padmanabhan, Petro Saviuk, Michael Benedict, Jasmine Qin, and many others, provided a lot of useful feedback, requirements, and worked with us to make the sustainable business growth happen.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Site Reliability Engineer, Infrastruc...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The Site Reliability Engineering organization at Pinterest is accountable for ensuring overall Pinterest availability as well as enhancing Engineering teams’ capability to design, build and operate robust systems at scale

Pinterest’s applications and infrastructure that handle billions of monthly page views and petabytes of data as Pinterest continues to grow and scale. As a Pinterest SRE, you will design and build systems, platforms, tools, frameworks and methodologies to assure the reliability of our large-scale distributed systems.

What You’ll Do:

  • Develop software solutions to enable relailbity and operability of large scale distributed systems handling petabytes of data and serving 
  • Build a deep understanding of how Pinterst’s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
  • Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
  • Build meaningful, insightful and actionable SLIs
  • Automate critical portions of Pinterest’s engineering processes, to minimize risk and maximize the speed of innovation
  • Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world

What We’re Looking For:

  • Strong knowledge of Linux/Unix/BSD internals and experience working with open source software (e.g. MySQL, Hadoop, Envoy, HAProxy, Nginx)
  • Experience with technologies such as ElasticSearch, ZooKeeper, HBase, Hadoop, Memcache and Kafka with a focus on reliability, automation, operability and performance
  • 2+ years of experience with programming languages (Python, Golang, Ruby, etc.)
  • Infrastructure as code a plus (e.g. Terraform, Puppet, Chef, Ansible, Salt, Fabric, Docker, etc)
  • Bonus points if experienced with deploying web apps to cloud infrastructure (AWS, etc.) and working with distributed, service-oriented architecture 

#LI-SG1

iOS Engineer, Metrics Quality
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is a metric-driven company, not only do we use metrics to make our decisions, but our state of the art machine learning depends on accurate metrics.  This role is dedicated to making sure that the metrics on the iOS client are accurate.  We are building the core of this team in Toronto, where you'll be working with San Francisco teammates on larger projects. 

What you’ll do

  • Create tooling and education allowing for client engineers to instrument metric logging effectively.
  • Work with data engineers to create, manage, and investigate early metric anomaly alerts.
  • Certify company critical metrics are accurate and ensure they will remain accurate with a variety of testing.
  • Creating new accurate metrics to allow a deeper understanding of the user’s behavior and needs.

What we’re looking for:

  • 4+ years of experience working in iOS development.
  • Strong experience in writing reliable, maintainable, and impactful code that will be used by many other engineers.
  • Enthusiasm for collaborating with a diverse array of partners, including product managers and engineers across a variety of platforms and teams, data engineers, release services, and analysts.
  • Enthusiasm for “behind the scenes” work that is a high priority for Pinterest, but will not be seen by users.
  • Strong communication skills and confidence working with developers in a different time zone (this team will frequently work with SF engineers).
  • Experience with relational databases and query languages.
  • Nice to have: Experience maintaining and improving UI tests.
  • Nice to have: Experience building ways to measure user behavior on iOS applications.
  • Nice to have: Experience evaluating user metrics to understand app usage.
  • Nice to have: Experience working with developers as customers to address the needs of engineers.

#LI-AG1

Fullstack Engineer, Growth Platform
Dublin, IE

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We are looking for founding members of our Growth Engineering team in Dublin. You’ll be a pivotal part of the global Growth organization and its mission, while helping to build the inaugural branch of the Growth organization in Dublin. Specifically, you’ll join the Growth Platform team as a full stack engineer whose mission is to accelerate global growth by building tools, infrastructure, and platforms to improve user experience (such as tools to automate, personalize and simplify the building of user experiences across the world). You’ll get to work day-to-day with engineering and cross-functional teams while expanding your technical expertise on all sides of the projects. 

What you’ll do:

  • Impact & Mission Join a small, but rapidly-growing engineering group in Pinterest's largest international office to establish infrastructure and platforms to achieve and foster global growth
  • Pinterest Global Growth Build Pinterest growth, experience and marketing tools that increase our product relevance and understanding around the world
  • Feature & Platform Development Work across all parts of the stack and be flexible working with different technologies
    • Improve the experience framework backend to handle more types of use cases, while serving hundreds of millions of requests daily
    • Build new open-ended self-serve experiences that affect end-users and can be easily configured and deployed by marketing partners
    • Scale the usefulness of the experience framework tool by making it easier to use and understand by different types of users
    • Determine the forward direction for our Pinner Communications Platform which helps us target and reach out to creators, advertisers and pinners with personalized messaging
  • Collaboration 
    • Work in dynamic and diverse environments alongside engineering, product, international experts, marketing and growth teams
  • Evangelization across engineering to implement and adopt best practices to simplify our codebase and promote growth

What we’re looking for:

  • 3+ years of full stack development building successful products and/or systems.
  • Experience/familiarity with Javascript, Python, React, mysql, bash/scripting or similar
  • An interest in building tools for technical and non-technical users that improve the daily lives and effectiveness of these individuals
  • Self-driven and openness to learn quickly, explore, flexibility without being afraid to dig deep
iOS Engineer, Shopping Product
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Shopping is at the core of Pinterest’s mission to help people create a life they love. The shopping product team at Pinterest is inventing a brand new, more visual and personalized shopping experience for 350M+ users worldwide. The team is responsible for inspiring Pinners to shop, helping them find the best product and providing seamless checkout experience. 

You’ll be responsible for building an iOS application that enables Pinners to create the life they love with product discovery and decision experiences that guide from inspiration to purchase. Working closely with the design team, you’ll build beautiful iOS shopping applications for phones and tablets.

What you'll do:

  • Build features to power the future of Shopping on Pinterest.
  • Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B tests, to architecting and building solutions that can scale to support millions of users.
  • Work with cross functional peers (PM, Design) to define the product roadmap
  • Analyze and visualize data to drive product insights and to inform our decisions.
  • Contribute best-in-class programming skills to develop highly innovative consumer-facing mobile products

What we're looking for:

  • 2+ years of industry iOS application development experience 
  • Experience in building consumer facing products on iOS platforms for a rapidly iterating product
  • Holistic knowledge and passion for the iOS platform
  • Strong command of data that could help improve the user experience
  • Strong communication skills and great product intuition

 

#LI-AG1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like