Manas Realtime — Enabling Changes to Be Searchable in a Blink of an Eye

1,380
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Michael Mi | Tech Lead, Core Product Serving Infra


Manas, Pinterest’s in-house search engine, is a generic information retrieval platform. As we discussed in our previous post, Manas was designed as a search framework with high performance, availability, and scalability. Today, Manas powers search for the majority of Pinterest products, including Ads, Search, Homefeed, Related Pins, Visual, and Shopping.

One of the key metrics for a search system is the indexing latency, which is the time taken to update the search index to reflect changes. As we keep growing the system capabilities and onboarding new use cases, the ability to instantly index new documents has become more important. Manas already supports incremental indexing, which is able to provide indexing latency within the order of tens of minutes. Unfortunately, this can’t meet our growing business requirements from Ads and following feeds. We decided to build a new module within Manas to further reduce indexing latency to a fraction of a second.

In this blog post we describe the architecture of the system and its key challenges, and we provide details about the tradeoffs we made.

Challenges

New requirements come with new challenges. Here are several of the major challenges we faced.

Indexing Latency

The tiny batch approach, aka near-realtime, is the most popular choice for open source projects like Lucene, Vespa, etc. With this approach, the newly written document is not searchable until index commit is called. As a result, you need to make a tradeoff between indexing latency and throughput. Unfortunately, we can’t leverage this approach to reduce indexing latency to the order of seconds.

Index Refresh Ability

One of the drawbacks of realtime serving is the lack of index refresh agility. For a batch pipeline, it is trivial to rerun the indexing job to pick up all schema changes at once. However, when it comes to the realtime serving pipeline, an efficient index refresh support becomes complicated.

Scale-up for Constantly Changing Data

To avoid over-provisioning, auto-scaling was employed to adjust replicas based on the actual query load. If the index is immutable, it is relatively easy to bring up new replicas: you just need to copy the index to new nodes, and you are done. All of the difficulty lies in handling the constantly changing index: how to ensure that all replicas end up with the same index?

Error Recovery

Manas is a data-intensive service where each host may serve an index of up to several hundred GBs. Manas is also a stateful system; a bad binary could introduce data issues that rollbacks would not be able to fix. We needed to build a system that supports both fault tolerance and error recovery so that it is possible to recover from both binary bugs and data corruption.

Moving from Static to Realtime

Let’s take a brief look at the differences between conventional static serving and realtime serving. As shown in the above diagram, the major work for realtime serving is moving the indexing pipeline from offline to online.

For static serving, the indexes are generated offline with a batch workflow, and then they are copied to leaf for online serving. With batch workflows, due to the high framework overhead, it is barely possible to build a servable index within a fraction of a second. For realtime serving, instead of using an offline workflow, all writes are handled on the fly within the service. In addition, the realtime indexing pipeline handles writes in a way that generates the same index format as the static indexing pipeline, allowing us to reuse the entire index read logic. With this in mind, let’s continue the journey of understanding how realtime serving works.

Indexing Interface

Instead of directly using RPC, Kafka was employed as our high write throughput stream. Leaf servers continuously pull mutations to build incremental indexes. Turns out this decision dramatically simplified our system in multiple ways:

  • Data replications and write failures are taken care of by Kafka.
  • With the seek back ability, the Kafka queue also serves as WAL.
  • With a strict ordering guarantee in each partition, the system can blindly apply deletions without needing to worry about correctness.

Architecture Overview

Since the serving logic can be reused with a shared index format, we will focus on the indexing data flow.

Essentially, realtime Manas leaf is a LSM engine, which converts random IOs writes into sequential IOs and enables efficient serving for both read amplification and write amplification applications. As shown below, the whole indexing process consists of three critical steps. Let’s discuss them one by one.

Realtime Segment Build

We introduced realtime segments except for the existing static segments. As shown above, there are two types of realtime segments in the system: active realtime segments, and sealed realtime segments.

  • Active realtime segment, the only mutable component, is used to accumulate mutations (adds/deletes) pulled from Kafka. It’s worth pointing out that after a document is added into a realtime segment, it becomes searchable immediately after the document level commit.
  • Once the active realtime segment reaches a configurable threshold, it is sealed, becomes immutable, and is put into a flush queue. Meanwhile, a new active realtime segment is created to continue accumulating mutations.

In the case of a service restart, the realtime segments can be reconstructed by replaying messages from Kafka.

Index Flush

Index flush is the process of persisting in-memory data from a realtime segment into a compact index file. A flush is automatically triggered when a realtime segment gets sealed, and a flush can also be manually triggered using a debug command.

The index flush is a beneficial operator that guarantees data persistency so that we don’t need to reconstruct in-memory segments from scratch during restart. In addition, flushing reduces a segment’s memory footprint and improves serving efficiency with a compact immutable index.

Index Compaction

Over time, multiple generated small segments hurt serving performance. To overcome this, we introduced a background compaction thread to merge small segments into bigger ones. Since deletion operators just mark documents as deleted instead of physically deleting them, the compaction thread also persists these deleted/expired documents.

After each flush and compaction operator, a new index manifest consisting of all the static segments would be generated. Kafka offsets, used as checkpoints, are also added into each manifest. Based on the checkpoints, the service knows where to consume messages after a restart.

Detailed Design

In this section, we will cover several key areas in more detail. Let’s start with the most interesting part, the concurrency model.

Concurrency Model

The realtime segment, as aforementioned, is the only mutable component where we need to handle both read and write simultaneously. Unfortunately, the near-realtime approach employed by open source projects can’t meet our business requirement. Instead, we chose a different approach that enables us to commit a document immediately after adding into the index without waiting for an index flush. For the sake of performance, we employed a lock-free technique for data structures tailored to our usage. Now let’s open the box!

Realtime Segment

Each realtime segment consists of an inverted index and a forward index. The inverted index is logically a mapping from term to posting list, a list of document ids, used for retrieval. Meanwhile, the forward index stores an arbitrary binary blob used for full scoring and data fetching. Let’s only focus on realtime inverted index part, which is more interesting and challenging as compared to the forward index.

At a high level, the major difference between a realtime segment and a static segment is mutability. For the realtime inverted index, the map from term to posting list needs to be a concurrent one. This is well supported by open sources like folly’s concurrent hashmap. What we care about more is the internal representation for the posting list, which can support our concurrency model in an efficient way.

Append-only Vector

Usually, it is more efficient and easier to reason about a single-writer, multiple-readers model. We chose a similar data model as HDFS with an append-only lock-free data structure. Let’s take a closer look at how the reader and the writer interact with each other.

  • Writer appends doc id into the vector, then commits size to make it accessible to readers
  • Reader takes a snapshot up till committed size before accessing data

In order to avoid memory copying overhead as the posting list grows, internally we manage data as a list of buckets. We just need to add a new bucket without touching old ones when we run out of capacity. In addition, usually search engines use skip lists to speed up the skip operator. Thanks to this format, it is convenient to support a single-level skip list, which is good enough for realtime inverted index since the size of it is usually small.

Document Atomicity

Now with an append-only vector, we are able to achieve atomicity for a single posting list. However, a document can contain a list of terms, and we may end up returning unexpected documents with a partially updated index. To address this potential issue, we introduced a document level commit to guarantee document atomicity. In the serving pipeline, an additional filter is used to make sure only committed documents are returned.

Speaking of document atomicity, document updating is another scenario worth mentioning here. For each document update, we deliberately convert it to two operators: adding the new document, then deleting the old one from the index. Although each operator is atomic, together we can’t guarantee atomicity. We think it is ok to either return the old version or the new version in a very short time window, but nevertheless, we added dedupe logic into the serving pipeline to filter out the old one when both are returned.

Writes Scaling

One question that naturally comes up is that if your data structures only support the single-write and multiple-reads concurrent model, what if a single thread can’t handle all the writes in time? It does not seem like a good idea to blindly add more shards just to scale write throughput. While this is a valid concern, it has already been taken care of in our design.

The single-write and multiple-reads concurrent model used for data structures doesn’t mean we are not able to use multiple threads for writes. We planned to use the term-shard strategy to support writes with multiple threads. As shown in the above diagram, for a given document with a list of terms, each term would be always mapped to the fixed thread so that all data structures tailored for single-write and multiple-reads can be reused directly without any limitations.

Index Refresh

Index refresh ability is a critical feature for our products, enabling quick turnaround and improving dev velocity. Generally, two approaches can be used to refresh the index in an efficient way, backfilling on the fly and reinstating from the offline built index, respectively.

Backfilling Index

We provide the ability to backfill documents at a reasonable throughput. To avoid impacting production freshness, we need a separate stream for backfill traffic with a lower priority. As a result, it is possible that two versions of a document are present in both streams and the old version overrides the new one. To overcome this, we need to introduce a versioning mechanism and a conflict resolver in the realtime indexing pipeline to decide which one is more fresh.

Reinstating from Offline Built Index

Sometimes, backfilling at a given speed for a full dataset would be too time-consuming. Another quicker index refresh approach we support is building an index offline, and then reinstating from it, with a synchronization mechanism between the offline built index and Kafka stream.

Failover and Auto-scaling

From time to time, we need to bring up new instances for various reasons, like failover and auto-scaling, etc. For static serving, it is easy to start a new instance with an immutable index downloaded from the index store. However, it becomes complicated for realtime serving with a constantly changing index. How do we ensure that new instances have the same copy of the index as others eventually?

We decided to use leader-based replication, as shown in the above diagram. Our process would look like this:

  1. The leader periodically takes a new snapshot and uploads it to the durable index store
  2. New instances download the latest snapshot, by default, from the index store
  3. New instances resume consuming messages from Kafka based on the checkpoint from the snapshot index
  4. New instances start serving traffic once they have caught up

There are some key points in the design worth pointing out:

Leader Election

The only responsibility of the leader is to take snapshots and upload the index periodically. This means we can afford to have no leader or have multiple leaders for a short period of time, up to hours. Therefore, we have some flexibility in choosing a leader election algorithm. For simplicity, we chose to use our cluster maintenance job to select a leader statically, where we periodically check if we have a good leader.

Snapshot Upload

Usually, the new instance just connects to the leader to download the latest snapshot. In this approach, the snapshot downloading from new instances would potentially overload the leader, leading to cascading failures. Instead, we chose to upload snapshots periodically to the index store, trading space and freshness for stability. In addition, the uploaded snapshots are useful for error recovery, which will be covered shortly.

Error Recovery

As aforementioned, error recovery is another challenge for realtime serving system. There are some specific scenarios involving data corruption we need to handle.

Input Data Corruption

We use Kafka as our input write stream; unfortunately, those messages are immutable due to the fact that a producer can just append messages to it but can’t change content for existing ones. This means once the data corruption is introduced into Kafka messages, it is permanent. Thanks to the uploaded snapshots, we have the ability to rewind our index to the point without corruption, skip corrupted messages, and then consume new messages with the fix.

Binary Bugs Caused Data Corruption

Although we have a mature indexing validation pipeline for static clusters to guarantee no issues with the new index and the new binary before swapping in the new version, it is still possible that some bugs sneak into production. Fortunately, we can fix the issue by rolling back the binary or index. It becomes much harder for realtime serving where rolling back the binary can’t roll back the errors in the index. Using our snapshot uploading mechanism, we are able to rollback the binary together with a rewinded index and then replay messages from Kafka to fix errors in the index.

What’s next

As more scenarios are onboarded to Manas, we need to keep improving the system’s efficiency, scalability, and capability. Some interesting projects in our roadmap are as follows:

  • Cohosting static and realtime clusters to simplify our serving stack
  • Optimizing the system to support a large dataset
  • Building a generic embedding based retrieval to power advanced scenarios

Acknowledgments: This post summarizes several quarter’s work that involved multiple teams. Thanks to Tim Koh, Haibin Xie, George Wu, Sheng Chen, Jiacheng Hong and Zheng Liu for their countless contributions. Thanks to Mukund Narasimhan, Angela Sheu, Ang Xu, Chengcheng Hu and Dumitru Daniliuc for many meaningful discussions and feedbacks. Thanks to Roger Wang and Randall Keller for the great leadership.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Senior Backend Engineer, User Underst...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The user understanding team builds cutting edge user understanding models and systems to deeply understand the evolving interests, intents and tastes of our 450m+ users, which is one of the most essential ML components powering personalization of Pinterest products across Discovery (Homefeed, Search, Related Pins), Ads, Shopping and Growth. As a backend engineer, you will have the opportunity to help build the next generation large-scale user signal platform and potentially affect every surface of Pinterest with deeper user understanding.

What you’ll do:

  • As a backend engineer, design and build large scale systems that can process profile, activities and feedback from hundreds of millions of pinterest users.
  • Design and build systems / tools, e.g.,
  • Drive cross functional collaborations with partner teams adopting user signals and models, e.g. ads, Homefeed, shopping, search and etc.
  • Work with ML Engineers to manage large-scale ML models in production

What we’re looking for:

  • 4+ years of industry experience.
  • Expert in Python and Java or other static languages like Go or C++
  • Hands-on experience building complex backend systems leveraged by multiple clients.
  • Hands-on experience with big data technologies (e.g., Hadoop/Spark/Kafka) and scalable realtime systems that process stream data.
  • Nice to have:
    • Experience working with privacy sensitive data or GDPR compliance.
    • Basic knowledge of machine learning (or willing to learn!): feature extraction, training, model serving etc.

#LI-TG1

Engineer Manager, Content Knowledge S...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest helps people Discover and Do the things they love. We have more than 450M monthly active users who actively curate an ecosystem of more than 100B Pins on more than 1B boards, creating a rich human curated graph of immense value. 

Technically, we are building out an internet scale personalized recommendation engine in 22+ languages, which requires a deep understanding of the users and content on our platform. As engineer manager on the Content Knowledge Signal team, you’ll work on building 20+ content understanding signals based on Pinterest Knowledge Graph, which will make measurably positive impact on hundreds of millions of users with improved recommendation and featurization breakthroughs on almost all Pinterest product surfaces (Discovery, Shopping, Growth, Ads, etc). 

What you'll do:

  • Manage a horizontal team of talented and dedicated ML engineers to build the foundational content understanding and engagement features of our contents to be used across all Pinterest ecosystems
  • Utilize state of the art algorithms/industry best practice to build and improve content understanding signals 
  • Partner with other engineering teams and sales & marketing team to discover future opportunities to improve content recommendation on Pinterest
  • Hire new engineers to grow the team
  • Build ML models using text and visual information of a pin, identify the most relevant set of text annotations for that pin. These sets of highly relevant annotations are among the most important features used in more than 30 use cases within Pinterest, including key ranking models of Homefeed, Search and Ads.
  • Build ML models using text and images of the products, to understand their product categories (bags, shoes, shirts, etc) and their attributes (brand, color, style, etc). They are used to greatly improve relevance for product recommendation on major shopping surfaces. 
  • Build ML models to understand search queries, then use them, together with Pin level signals, to boost search relevance. 
  • Build graph based embedding as well as explicit annotation to represent the specialties of our native content creators, to improve creator and native content recommendation.
  • Build highly efficient and expandable data pipelines to understand engagement data at various entity levels. Such engagement signals are the major feature of the ranking models for our three main Discovery surfaces. 
  •  

What we're looking for:

  • 2+ years of industrial experience in ML team’s EM or TL for one or multiple of the following use cases with large scale: ads targeting, search and discovery, growth, content/user understanding
  • Hands-on experience working with ML algorithm development and productization.  
  • Experience working with PMs and XFN partners on E2E systems and moving business metrics

#TG1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Software Engineer, Machine Learning P...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We are seeking a senior software engineer to build and boost Pinterest’s machine learning training and serving platforms and infrastructure. The candidate will work with different teams to design, build and improve our ML systems, including the model training computation platform, serving systems and model deployment systems.

What you'll do:

  • Design and build solutions to make the model training, serving and deployment process more efficient, more reliable, and less error-prone by human mistakes.
  • Design and build long term solutions to boost the model iteration velocity for machine learning engineers and data scientists.
  • Work extensively with ML engineers across Pinterest to understand their requirements, pain points, and build generalized solutions. Also work with partner teams to drive projects requiring cross-team coordination. 
  • Provide technical guidance and coaching to other junior engineers in the team.

What we're looking for:

  • Hands-on experience developing large-scale machine learning models in production, or experience working on the systems supporting onboarding large-scale machine learning models.
  • Ability to drive cross-team projects; Ability to understand our internal customers (ML practitioners), their common usage patterns and pain points.
  • Flexibility to work across different areas: tool building, model optimization, infrastructure optimization, large scale data processing pipelines, etc.
  • 5+ years of professional experience in software engineering.
  • Fluency in Python and either Java or Scala (Fluency in C++ for the MLS role).
  • Past tech lead experience is preferred, but not required. (Not necessary for the MLS role).

#LI-GB2

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Engineering Manager, Ads Engagement M...
San Francisco, CA, US; Palo Alto, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is one of the fastest growing online ad platforms, and our success depends on mining rich user interest data that helps us connect users with highly relevant advertisers/products. We’re looking for an Engineering Manager with experience in machine learning, data mining, and information retrieval to lead a team that develops new data-driven techniques to show the most engaging and relevant promoted content to the users. You’ll be leading a world-class ML team that is growing quickly and laying the foundation for Pinterest’s business success.

What you’ll do:

  • Manage and grow the engineering team, providing technical vision and long-term roadmap
  • Design features and build large-scale machine learning models to improve ads engagement prediction
  • Effectively collaborate and partner with several cross functional teams to build the next generation of ads engagement models
  • Mentor and grow ML engineers to allow them to become experts in modeling/engagement prediction 

What we’re looking for:

  • Degree in Computer Science, Statistics or related field
  • Industry experience building production machine learning systems at scale, data mining, search, recommendations, and/or natural language processing
  • 1+ years of experience leading projects/ teams either as TL/ TLM/ EM
  • Cross-functional collaborator and strong communicator
  • Experience with ads domain is a big plus

#LI-SM4

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like