Faster Flink Adoption with Self-Service Diagnosis Tool at Pinterest

538
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

Fanshu Jiang & Lu Niu | Software Engineers, Stream Processing Platform Team


At Pinterest, stream data processing powers a wide range of real-time use cases. In recent years, the platform powered by Flink has proven to be of great value to the business by providing near real-time content activation and metrics reporting, with the potential to unlock more use cases in the future. However, to take advantage of that potential, we needed to address the issue of developer velocity.

It can take weeks to go from writing the first line of code to a stable data flow in production. Troubleshooting and tuning Flink jobs can be particularly time-consuming, due to the number of logs and metrics to investigate and the variety of configs available to tune. Sometimes, it requires a deep understanding of Flink internals to find the root cause of issues during development. This can not only affect developer velocity and create a subpar Flink onboarding experience, but also requires significant platform support, causing restrictions to scalability of streaming use cases.

To make investigation easier and faster, we built out a Flink diagnosis tool, DrSquirrel, to surface and aggregate job symptoms, provide insights into the root cause, and suggest a solution with actionable steps. The tool has resulted in significant productivity gains for developers and the platform team since its release.

What is challenging about Flink job troubleshooting?

Massive pool of scattered logs and metrics, only a few of which matter

For troubleshooting, engineers usually:

  • scroll through a wall of JM/TM logs from YARN UI
  • check dozens of job/server metric dashboards
  • search and verify job configs
  • click through the Flink Web UI job DAG to find details like checkpoint alignment, data skew and backpressure

However 90% of the stats we spend time on are either benign or simply unrelated to the root cause. Having a one-stop-shop that aggregates only useful information and surfaces only what matters to troubleshooting saves enormous amounts of time.

Here are the bad metrics, now what?

This is a commonly asked question once stakeholders identify bad metrics, because more reasoning is required to get the root cause. For example, checkpoint timeout could mean incorrect timeout configuration, but also could be a consequence of backpressure, slow s3 upload, bad GC, or data skew; Lost TaskManager logs could mean bad node, but oftentimes is a result of either heap or RocksDB statebackend OOM. It takes time to understand all that reasoning and thoroughly verify each possible cause. However, 80% of the issue-fixing follows a pattern. This made us wonder — as a platform team, should we analyze the stats programmatically and tell stakeholders what to tune without having them do the reasoning?

Troubleshooting doc is far from enough

We provide a troubleshooting doc to customers. However, with the growing number of troubleshooting use cases, the doc is getting too long to quickly spot the relevant diagnosis and instructions for an issue. Engineers also have to manually apply if-else diagnosis logic to determine the root cause. This has added much friction to self-serve diagnosis, and the reliance on the platform team for troubleshooting remains. Besides, the doc is not great at call-to-action whenever the platform pushes a new job health requirement. We realize that a better tool is needed to efficiently share troubleshooting takeaways and enforce cluster-wise job health requirements.

Dr. Squirrel, a self-service diagnosis tool for troubleshooting

Given the above challenges, we built out DrSquirrel — a diagnosis tool for fast issue detection and troubleshooting guidance designed to:

  • cut down the troubleshooting time from hours to minutes
  • reduce the tools developers need for investigations from many to one; and
  • lower the required Flink internal knowledge for troubleshooting from intermediate to little

In a nutshell, we aggregate useful information in one place, perform job health checks, flag unhealthy ones explicitly, and provide root cause analysis and actionable steps to help fix the issues. Let’s take a look at some feature highlights.

More efficient ways to view logs

For each job run, Dr. Squirrel highlights exceptions that directly trigger restarts (i.e. TaskManager lost, OOM) to help quickly find the relevant exceptions to focus on from a massive pool of logs. It also collects all warnings, errors, and info logs that contain a stack trace in separate sections. For each log, Dr. Squirrel checks the content to see if an error keyword can be found, then provides a link to our step-by-step solution in the troubleshooting guide.

Dr.Squirrel suggestion

All logs are searchable using the search bar. On top of that, Dr. Squirrel provides two ways to view logs more efficiently — Timeline view and Unique exception view. As shown below, the Timeline view allows you to view logs chronologically with class name and pre-populated ElasticSearch link if more details are needed.

Timeline view of logs

With one click, we can switch to the Unique Exception view, where the same exceptions are grouped in one row with metadata such as first, last, and total occurrence. This simplifies the process of identifying the most frequent exceptions.

Unique exception view

Job health at a glance

Dr. Squirrel provides a health check page that enables engineers, whether beginners or experts, to tell confidently whether the job is healthy. Instead of showing plain metric dashboards, Dr. Squirrel monitors each metric for 1 hour and flags explicitly if it passes our platform stability requirements. This is an efficient and scalable way for the platform team to communicate and enforce what is considered stable.

The health check page consists of multiple sections, each focusing on a different aspect of job health. Quick browsing through these sections is all needed to get a good idea of the overall job health:

  • Basic Job Stats section monitors basic stats such as throughput, rate of full restarts, checkpoint size/duration, consecutive checkpoint failure, max parallelism over the past 1h. When metrics fail the health check, they are marked as Failed and ranked at the top.

Basic Job stats section

  • Backpressured Tasks tracks the backpressure situation of each operator at fine granularity. No backpressure within a minute is visualized as a green square, otherwise a red square. 60 squares for each operator, representing the backpressure situation of the past 1 hour. This makes it easy to identify how frequently backpressure happens and which operator starts the earliest.

Backpressured Task section

  • GC Old Gen Time section has the same visualization as backpressure to provide an overview of whether the GC is occurring too often and could potentially affect throughput or checkpoint. With the same visualization, it becomes obvious whether GC and backpressure happen at the same time and whether GC may potentially cause backpressure.

GC old gen section

  • JobManager/TaskManager Memory Usage tracks the YARN container memory usage, which is the resident set size (RSS) memory of the Flink Java process we collect through daemon running on the worker nodes. RSS memory is more accurate because it includes all sections in the Flink memory model as well as memory that’s not tracked by Flink, such as JVM process stack, threads metadata, or memory allocated from user code through JNI. We mark the configured max JM/TM memory in the graph, as well as 90% usage threshold to help users quickly spot which containers are close to OOM.

JM/TM memory graph

  • CPU% Usage section surfaces the containers that use more CPU capacity than the vcores they are assigned to. This helps monitor and avoid “Noisy neighbor” issues in the multi-tenant Hadoop cluster. Very high CPU% usage could result in one user’s workload impacting the performance and stability of another user’s workload.

CPU% usage section

Effective configurations

Flink jobs can be configured at different levels, such as in-code configurations at execution level, job properties file, command line arguments at client level and flink-conf.yaml at system level. It’s not uncommon for engineers to configure the same parameter at different levels for testing or hotfixing. With the override hierarchy, it is not obvious what value is eventually taking effect. To address this issue, we built a configuration library that figures out effective configuration values that the job is running with and surfaces these configurations to Dr. Squirrel.

Queryable cluster-wise job healthiness

Provided with abundant job stats, Dr. Squirrel becomes a resource center to learn cluster-wise job healthiness and find insights into platform improvements. For example, what are the top 10 restart root causes or what percentage of jobs run into memory issues or backpressure.

Architecture

As seen in the features above, metrics and logs are gathered all into one place. To collect them in a scalable way, we added a MetricReporter and KafkaLog4jAppender to our Flink custom build to continuously send metrics and logs to kafka topics. The KafkaLog4jAppender also serves to filter out logs that matter to us — warnings, errors, and info logs that come with a stacktrace. Following that is FlinkJobWatcher — a Flink job that joins metrics and logs that come from the same job after a series of parsing and transformation. FlinkJobWatcher then creates a snapshot of job health every 5 min and sends it to the JobSnapshot Kafka topic.

The growing number of Flink use cases have been introducing massive amounts of logs and metrics. FlinkJobWatcher as a Flink job handles the increasing data scale perfectly and keeps the throughput on par with the number of use cases with easy parallelism tuning.

Our Flink custom build

Once the JobSnapshot is available, more data needs to be fetched and merged into the JobSnapshot. For this purpose, we built a RESTful service using dropwizard that keeps reading from the JobSnapshot topic and pulls external data via RPC. The external data sources include YARN ResourceManager to get static data such as username and launch time, Flink REST API to get configurations, an internal tool called Automated Canary Analysis(ACA) to compare time series metrics against a threshold with fine-grained criteria, and a couple of other internal tools that allow us to surface custom metrics like RSS memory and CPU% usage, which are collected from a daemon running on the worker nodes. A nice UI is also built out with React to make job health easy to explore.

Dr. Squirrel web service

Future Work

We will continue improving Dr. Squirrel with better job diagnosis capability to help us move one step closer to fully self-serve onboarding:

  • Capacity planning: monitor and evaluate throughput, usage of memory and vcores to find the most efficient resource settings.
  • Integration with CICD: we are running a CICD pipeline to automatically verify and push changes from dev to prod. Dr.Squirrel will be integrated with CICD to provide more confidence about the job health situation as CICD pushes out new changes.
  • Alert & notification: notify job owner or platform team with a health report summary.
  • Per-job cost estimate: show cost estimate of each job based on resource usage for budget planning and awareness.

Acknowledgment

Shoutout to Hannah Chen, Nishant More, and Bo Sun for their contributions to this project. Many thanks to Ping-Min Lin for setting up the initial UI work and Teja Thotapalli for the infra setup on the SRE side. We also want to thank Ang Zhang, Chunyan Wang, Dave Burgess for their support and all our customer teams for providing valuable feedback and troubleshooting scenarios to help us make the tool powerful.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Engineer Manager, Content Knowledge S...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest helps people Discover and Do the things they love. We have more than 450M monthly active users who actively curate an ecosystem of more than 100B Pins on more than 1B boards, creating a rich human curated graph of immense value. 

Technically, we are building out an internet scale personalized recommendation engine in 22+ languages, which requires a deep understanding of the users and content on our platform. As engineer manager on the Content Knowledge Signal team, you’ll work on building 20+ content understanding signals based on Pinterest Knowledge Graph, which will make measurably positive impact on hundreds of millions of users with improved recommendation and featurization breakthroughs on almost all Pinterest product surfaces (Discovery, Shopping, Growth, Ads, etc). 

What you'll do:

  • Manage a horizontal team of talented and dedicated ML engineers to build the foundational content understanding and engagement features of our contents to be used across all Pinterest ecosystems
  • Utilize state of the art algorithms/industry best practice to build and improve content understanding signals 
  • Partner with other engineering teams and sales & marketing team to discover future opportunities to improve content recommendation on Pinterest
  • Hire new engineers to grow the team
  • Build ML models using text and visual information of a pin, identify the most relevant set of text annotations for that pin. These sets of highly relevant annotations are among the most important features used in more than 30 use cases within Pinterest, including key ranking models of Homefeed, Search and Ads.
  • Build ML models using text and images of the products, to understand their product categories (bags, shoes, shirts, etc) and their attributes (brand, color, style, etc). They are used to greatly improve relevance for product recommendation on major shopping surfaces. 
  • Build ML models to understand search queries, then use them, together with Pin level signals, to boost search relevance. 
  • Build graph based embedding as well as explicit annotation to represent the specialties of our native content creators, to improve creator and native content recommendation.
  • Build highly efficient and expandable data pipelines to understand engagement data at various entity levels. Such engagement signals are the major feature of the ranking models for our three main Discovery surfaces. 
  •  

What we're looking for:

  • 2+ years of industrial experience in ML team’s EM or TL for one or multiple of the following use cases with large scale: ads targeting, search and discovery, growth, content/user understanding
  • Hands-on experience working with ML algorithm development and productization.  
  • Experience working with PMs and XFN partners on E2E systems and moving business metrics

#TG1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Software Engineer, Machine Learning P...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We are seeking a senior software engineer to build and boost Pinterest’s machine learning training and serving platforms and infrastructure. The candidate will work with different teams to design, build and improve our ML systems, including the model training computation platform, serving systems and model deployment systems.

What you'll do:

  • Design and build solutions to make the model training, serving and deployment process more efficient, more reliable, and less error-prone by human mistakes.
  • Design and build long term solutions to boost the model iteration velocity for machine learning engineers and data scientists.
  • Work extensively with ML engineers across Pinterest to understand their requirements, pain points, and build generalized solutions. Also work with partner teams to drive projects requiring cross-team coordination. 
  • Provide technical guidance and coaching to other junior engineers in the team.

What we're looking for:

  • Hands-on experience developing large-scale machine learning models in production, or experience working on the systems supporting onboarding large-scale machine learning models.
  • Ability to drive cross-team projects; Ability to understand our internal customers (ML practitioners), their common usage patterns and pain points.
  • Flexibility to work across different areas: tool building, model optimization, infrastructure optimization, large scale data processing pipelines, etc.
  • 5+ years of professional experience in software engineering.
  • Fluency in Python and either Java or Scala (Fluency in C++ for the MLS role).
  • Past tech lead experience is preferred, but not required. (Not necessary for the MLS role).

#LI-GB2

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Engineering Manager, Ads Engagement M...
San Francisco, CA, US; Palo Alto, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is one of the fastest growing online ad platforms, and our success depends on mining rich user interest data that helps us connect users with highly relevant advertisers/products. We’re looking for an Engineering Manager with experience in machine learning, data mining, and information retrieval to lead a team that develops new data-driven techniques to show the most engaging and relevant promoted content to the users. You’ll be leading a world-class ML team that is growing quickly and laying the foundation for Pinterest’s business success.

What you’ll do:

  • Manage and grow the engineering team, providing technical vision and long-term roadmap
  • Design features and build large-scale machine learning models to improve ads engagement prediction
  • Effectively collaborate and partner with several cross functional teams to build the next generation of ads engagement models
  • Mentor and grow ML engineers to allow them to become experts in modeling/engagement prediction 

What we’re looking for:

  • Degree in Computer Science, Statistics or related field
  • Industry experience building production machine learning systems at scale, data mining, search, recommendations, and/or natural language processing
  • 1+ years of experience leading projects/ teams either as TL/ TLM/ EM
  • Cross-functional collaborator and strong communicator
  • Experience with ads domain is a big plus

#LI-SM4

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Engineering Manager, Ads Marketplace
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Within the Ads Quality team, we try to connect the dots between the aspirations of Pinners and the products offered by our partners. In this role, you will lead a team of engineers that is responsible for monetizing our Shopping and Creator surfaces. Using your strong analytical skill sets, a thorough understanding of auction mechanisms, and experience in managing an engineering team, you will advance the state of the art in Marketplace design and yield management.

What you’ll do:

  • Manage a team of engineers with a background in ML, backend development, economics, and data science to:
    • Monetize new surfaces effectively and responsibly 
    • Interface with our Product and Organic teams to understand requirements and build solutions that cater to our advertisers and users
    • Build models to enable scalable solutions for ad allocation, eligibility and pricing on the new surfaces
    • Hold a high standard for engineering excellence by building robust and future proof systems with an appreciation for simplicity and elegance
    • Identify gaps and opportunities as we expand and execute on closing those gaps effectively and in a timely manner
  • Work closely with Product on planning roadmap, set technical direction and deliver values
  • Coach and mentor team members and help them develop their career path and achieve their career goals

What we’re looking for:

  • Degree in Computer Science, Statistics, or related field
  • 2+ years of management experience
  • 5+ years of relevant experience
  • Background in computational advertising, econometrics, shopping
  • Strong industry experience in machine learning
  • Experience with ads domain is a big plus
  • Cross-functional collaborator and strong communicator

#LI-SM4

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Verified by
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
Security Software Engineer
You may also like