MemQ: An Efficient, Scalable Cloud Native PubSub System

396
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Ambud Sharma | Tech Lead and Engineering Manager, Logging Platform


The Logging Platform powers all data ingestion and transportation at Pinterest. At the heart of the Pinterest Logging Platform are Distributed PubSub systems that help our customers transport / buffer data and consume asynchronously.

In this blog we introduce MemQ (pronounced mem — queue), an efficient, scalable PubSub system developed for the cloud at Pinterest that has been powering Near Real-Time data transportation use cases for us since mid-2020 and complements Kafka while being up to 90% more cost efficient.

History

For nearly a decade, Pinterest has relied on Apache Kafka as the sole PubSub system. As Pinterest grew, so did the amount of data and the challenges around operating a very large scale distributed PubSub platform. Operating Apache Kafka at Scale gave us a great deal of insight on how to build a scalable PubSub system. Upon deep investigation of the operational and scalability challenges of our PubSub environment, we arrived at the following key takeaways:

  1. Not every dataset needs sub-second latency service, latency and cost should be inversely proportional (lower latency should cost more)
  2. Storage and Serving components of a PubSub system need to be separated to enable independent scalability based on resources.
  3. Ordering on read instead of write provides required flexibility for specific consumer use cases (different applications can have different for same dataset)
  4. Strict partition ordering is not necessary at Pinterest in most cases and often leads to scalability challenges.
  5. Rebalancing in Kafka is expensive, often results in performance degradation, and has a negative impact for customers on a saturated cluster.
  6. Running custom replication in a cloud environment is expensive.

In 2018, we experimented with a new type of PubSub system that would natively leverage the cloud. In 2019, we started formally exploring options on how to solve our PubSub scalability challenges and evaluated multiple PubSub technologies based on cost of operations as well as reengineering cost for existing technologies to meet the demands of Pinterest. We finally landed at the conclusion that we needed a PubSub technology that built on the learnings of Apache Kafka, Apache Pulsar, and Facebook LogDevice, and was built for the cloud.

MemQ is a new PubSub system that augments Kafka at Pinterest. It uses a decoupled storage and serving architecture similar to Apache Pulsar and Facebook Logdevice; however, it relies on a pluggable replicated storage layer i.e. Object Store / DFS / NFS for storing data. The net result is a PubSub system that:

  • Handles GB/s traffic
  • Independently scales, writes, and reads
  • Doesn’t require expensive rebalancing to handle traffic growth
  • Is 90% more cost effective than our Kafka footprint

Secret Sauce

The secret of MemQ is that it leverages micro-batching and immutable writes to create an architecture where the number of Input/output Operations Per Second (IOPS) necessary on the storage layer are dramatically reduced, allowing the cost effective use of a cloud native Object Store like Amazon S3. This approach is analogous to packet switching for networks (vs circuit switching, i.e. single large continuous storage of data such as kafka partition).

MemQ breaks the continuous stream of logs into blocks (objects), similar to ledgers in Pulsar but different in that they are written as objects and are immutable. The size of these “packets” / “objects,” known internally in MemQ as a Batch, play a role in determining the End-to-End (E2E) latency. The smaller the packets, the faster they can be written at the cost of more IOPS. MemQ therefore allows tunable E2E latency at the cost of higher IOPs. A key performance benefit of this architecture is enabling separation of read and write hardware dependending on the underlying storage layer, allowing writes and reads to scale independently as packets that can be spread across the storage layer.

This also eliminated the constraints experienced in Kafka where in order to recover a replica, a partition must be re-replicated from the beginning. In the case of MemQ, the underlying replicated storage only needs to recover the specific Batch whose replica counts were reduced due to faults in case of storage failures. However, since MemQ at Pinterest runs on Amazon S3, the recovery, sharding, and scaling of storage is handled by AWS without any manual intervention from Pinterest.

Components of MemQ

Client

MemQ client discovers the cluster using a seed node and then connects to the seed node to discover metadata and the Brokers hosting the TopicProcessors for a given Topic or, in case of the consumer, the address of the notification queue.

Broker

Similar to other PubSub systems, MemQ has the concept of a Broker. A MemQ Broker is a part of the cluster and is primarily responsible for handling metadata and write requests.

Note: read requests in MemQ can be handled directly by the Storage layer unless the read Brokers are used

Cluster Governor

The Governor is a leader in the MemQ cluster and is responsible for automated rebalancing and TopicProcessor assignments. Any Broker in the cluster can be elected a Governor, and it communicates with Brokers using Zookeeper, which is also used for Governor election.

The Governor makes assignment decisions using a pluggable assignment algorithm. The default one evaluates available capacity on a Broker to make allocation decisions. Governor also uses this capability to handle Broker failures and restore capacity for topics.

Topic & TopicProcessor

MemQ, similar to other PubSub systems, uses the logical concept of Topic. MemQ topics on a Broker are handled by a module called TopicProcessor. A Broker can host one or more TopicProcessors, where each TopicProcessor instance handles one topic. Topics have write and read partitions. The write partitions are used to create TopicProcessors (1:1 relation), and the read partitions are used to determine the level of parallelism needed by the consumer to process the data. The read partition count is equal to the number of partitions of the notification queue.

Storage

MemQ storage is made of two parts:

  1. Replicated Storage (Object Store / DFS)
  2. Notification Queue (Kafka, Pulsar, etc.)

1. Replicated Storage

MemQ allows for pluggable storage handlers. At present, we have implemented a storage handler for Amazon S3. Amazon S3 offers a cost effective solution for fault-tolerant, on-demand storage. The following prefix format on S3 is used by MemQ to create the high throughput and scalable storage layer:

s3://<bucketname>/<(a) 2 byte hash of first client request id in batch>/<(b) cluster>/topics/<topicname>

(a) = used for partitioning inside S3 to handle higher request rates if needed

(b) = name of MemQ cluster

Availability & Fault Tolerance

Since S3 is a highly available web scale object store, MemQ relies on its availability as the first line of defense. To accommodate for future S3 re-partitioning, MemQ adds a two-digit hex hash at the first level of prefix, creating 256 base prefixes that can, in theory, be handled by independent S3 partitions just to make it future proof.

Consistency

The consistency of the underlying storage layer determines the consistency characteristics of MemQ. In case of S3, every write (PUT) to S3 Standard is guaranteed to be replicated to at least three Availability Zones (AZs) before being acknowledged.

2. Notification Queue

The notification system is used by MemQ for delivering pointers to the consumer for the location of data. Currently, we use an external notification queue in the form of Kafka. Once data is written to the storage layer, the Storage handler generates a notification message recording the attributes of the write including its location, size, topic, etc. This information is used by the consumer to retrieve data (Batch) from the Storage layer. It’s also possible to enable MemQ Brokers to proxy Batches for consumers at the expense of efficiency. The notification queue also provides clustering / load balancing for the consumers.

MemQ Data Format

MemQ uses a custom storage / network transmission format for Messages and Batches.

The lowest unit of transmission in MemQ is called a LogMessage. This is similar to a Pulsar Message or Kafka ProducerRecord.

The wrappers on the LogMessage allow for the different levels of batching the MemQ does. Hierarchy of units:

  1. Batch (unit of persistence)
  2. Message (unit of producer upload)
  3. LogMessage (unit application interactions with)

Producing Data

A MemQ producer is responsible for sending data to Brokers. It uses an async dispatch model allowing for non-blocking sends to happen without the need to wait on acknowledgements.

This model was critical in order to hide the upload latencies for the underlying storage layers while maintaining storage level acknowledgements. This leads to the implementation of a custom MemQ protocol and client, as we couldn’t use existing PubSub protocols, which relied on synchronous acknowledgements. MemQ supports three types of acks: ack=0 (producer fire & forget), ack=1 (Broker received), and ack=all (storage received). With ack=all, the replication factor (RF) is determined by the underlying storage layer (e.g. in S3 Standard RF=3 [across three AZs]). In case acknowledgement fails, the MemQ producers can explicitly or implicitly trigger retries.

Storing Data

The MemQ Topic Processor is conceptually a RingBuffer. This virtual ring is subdivided into Batches, which allows simplified writes. Messages are enqueued into the currently available Batch as they arrive over the network until either the Batch is filled or a time-based trigger happens. Once a Batch is finalized, it is handed to the StorageHandler for upload to the Storage layer (like S3). If the upload is successful, a notification is sent via the Notification Queue along with the acknowledgements (ack) for the individual Messages in the Batch to their respective Producers using the AckHandler if the producers requested acks.

Consuming Data

MemQ consumer allows applications to read data from MemQ. The consumer uses the Broker metadata APIs to discover pointers to the Notification Queue. We expose a poll-based interface to the application where each poll request returns an Iterator of LogMessages to allow reading all LogMessages in a Batch. These Batches are discovered using the Notification Queue and retrieved directly from the Storage layer.

Other Features

Data Loss Detection: Migrating workloads from Kafka to MemQ required strict validation on data loss. As a result, MemQ has a built in auditing system that enables efficiently tracking E2E delivery of each Message and publishing metrics in near real-time.

Batch & Streaming Unification: Since MemQ uses an externalized storage system, it enables the opportunity to provide support for running direct batch processing on raw MemQ data without needing to transform it to other formats. This allows users to perform ad hoc inspection on MemQ without major concerns around seek performance as long as the storage layer can separately scale reads and writes. Depending on the storage engine, MemQ consumers can perform concurrent fetches to enable much faster backfills for certain streaming cases.

Performance

Latency

MemQ supports both size and time-based flushes to the storage layer, enabling a hard limit on max tail latencies in addition to several optimizations to curb the jitter. So far we are able to achieve a p99 E2E latency of 30s with AWS S3 Storage and are actively working on improving MemQ latencies, which increases the number of use cases that can be migrated from Kafka to MemQ.

Cost

MemQ on S3 Standard has proven to be up to 90% cheaper (avg ~80%) than an equivalent Kafka deployment with three replicas across three AZs using i3 instances. These savings come from several factors like:

  • reduction in IOPS

  • removal of ordering constraints

  • decoupling of compute and storage

  • reduced replication cost due to elimination of compute hardware

  • relaxation of latency constraints

Scalability

MemQ with S3 scales on-demand depending on write and read throughput requirements. The MemQ Governor performs real-time rebalancing to ensure sufficient write capacity is available as long as compute can be provisioned. The Brokers scale linearly by adding additional Brokers and updating traffic capacity requirements. The read partitions are manually updated if the consumer requires additional parallelism to process the data.

At Pinterest, we run MemQ directly on EC2 and scale clusters depending on traffic and new use case requirements.

Future Work

We are actively working on the following areas:

  • Reducing E2E latencies (<5s) for MemQ to power more use cases

  • Enabling native integrations with Streaming & Batch systems

  • Key ordering on read

Conclusion

MemQ provides a flexible, low cost, cloud native approach to PubSub. MemQ today powers collection and transport of all ML training data at Pinterest. We are actively researching expanding it to other datasets and further optimizing latencies. In addition to solving PubSub, MemQ storage can expose the ability to use PubSub data for batch processing without major performance impacts enabling low latency batch processing.

Stay tuned for additional blogs about how we optimized MemQ internals to handle scalability challenges and the open source release of MemQ.

Acknowledgements

Building MemQ would not have been possible without the unwavering support of Dave Burgess and Chunyan Wang. Also a huge thanks to Ping-Min Lin who has been a key driver of bug fixes and performance optimizations in MemQ that enabled large scale production rollout.

Lastly thanks to Saurabh Joshi, Se Won Jang, Chen Chen, Divye Kapoor, Yiran Zhao, Shu Zhang and the Logging team for enabling MemQ rollouts.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Video Platform Engineer
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Video is becoming the most important content format on Pinterest ecosystem. This role will act as an architect for Pinterest video platform, which responsible for the whole lifecycle of a video from uploading, transcoding, delivery and playback. The video architect will oversee Pinterest video platform strategy, owns the direction of what will be our next strategic investment to strengthen our video platform, and land the strategy into major initiatives towards the directions.

What you'll do: 

  • Lead the optimization and improvement in video codec efficiency, encoder rate control, transcode speed, video pre/post-processing and error resilience.
  • Improve end-to-end video experiences on lossy networks in various user scenarios.
  • Identify various opportunities to optimize in video codec, pipeline, error resilience.
  • Define the video optimization roadmap for both low-end and high-end network and devices.
  • Lead the definition and implementation of media processing pipeline.

What we're looking for: 

  • Experience with AWS Elemental
  • Solid knowledge in modern video codecs such as H.264, H.265, VP8/VP9 and AV1. 
  • Deep understanding of adaptive streaming technology especially HLS and MPEG-DASH.
  • Experience in architecting end to end video streaming infrastructure.
  • Experience in building media upload and transcoding pipelines.
  • Proficient in FFmpeg command line tools and libraries.
  • Familiar with popular client side media frameworks such as AVFoundation, Exoplayer, HLS.js, and etc.
  • Experience with streaming quality optimization on mobile devices.
  • Experience collaborating cross-functionally between groups with different video technologies and pipelines.

#LI-EA1

Senior Software Engineer, Data Privacy
Dublin, IE

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The Data Privacy Engineering team builds platforms and works with engineers across Pinterest to help ensure our handling of customer and partner data meets or exceeds their expectations of privacy and security.  We’re a small, and growing, team based in Dublin.  We own three major engineering projects with company-wide impact: expanding and onboarding teams doing big data processing to a new fine-grained data access platform, tracking how data moves and evolves through our systems, and ensuring data is always handled appropriately.  As a Senior Engineer, you’ll take a driving role on one of these projects and responsibility for working with internal teams to understand their needs, designing solutions, and collaborating with teams in Dublin and the US to successfully execute on your plans.  Your work will help ensure the safety of our users’ and partners’ data and help Pinterest be a source of inspiration for millions of users.

What you’ll do:

  • Consult with engineers, product designers, and security experts to design data-handling solutions
  • Review code and designs from across the company to guide teams to secure and private solutions
  • Onboard customers onto platforms and refine our tools to streamline these processes
  • Mentor and coach engineers and grow your technical leadership skills, with engineers in Dublin and other offices.
  • Grow your engineering skills as you work with a range of open-source technologies and engineers across the company, and code across Pinterest’s stack in a variety of languages

What we’re looking for:

  • 5+ years of experience building enterprise-scale backend services in an object-oriented programing language (Java preferred)
  • Experience mentoring junior engineers and driving an engineering culture
  • The ability to drive ambiguous projects to successful outcomes independently
  • Understanding of big-data processing concepts
  • Experience with data querying and analytics techniques
  • Strong advocacy for the customer and their privacy

#LI-KL1

Software Engineer, Key Value Systems
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest brings millions of Pinners the inspiration to create a life they love for everything; whether that be tonight’s dinner, next summer’s vacation, or a dream house down the road. Our Key Value Systems team is responsible for building and owning the systems that store and serve data that powers Pinterest's business-critical applications. These applications range from user-facing features all the way to being integral components of our machine learning processing systems. The mission of the team is to provide storage and serving systems that are not only highly scalable, performant, and reliable, but also a delight to use. Our systems enable our product engineers to move fast and build awesome features rapidly on top of them.

What you’ll do

  • Build, own, and improve Pinterest's next generation key-value platform that will store petabytes of data, handle tens of millions of QPS, and serve hundreds of use cases powering almost all of Pinterest's business-critical applications
  • Contribute to open-source databases like RocksDB and Rocksplicator
  • Own, improve, and contribute to the main key-value storage platform, streaming write architectures using Kafka, and additional derivative
  • RocksDB-based distributed systems
  • Continually improve operability, scalability, efficiency, performance, and reliability of our storage solutions

What we’re looking for:

  • Deep expertise on online distributed storage and key-value stores at consumer Internet scale
  • Strong ability to work cross-functionally with product teams and with the storage SRE/DBA team
  • Fluent in C/C++ and Java
  • Good communication skills and an excellent team player

#LI-KL1

Head of Ads Delivery Engineering
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is on a mission to help millions of people across the globe to find the inspiration to create a life they love. Within the Ads Quality team, we try to connect the dots between the aspirations of pinners and the products offered by our partners. 

You will lead an ML centric organization that is responsible for the optimization of the ads delivery funnel and Ads marketplace at Pinterest. Using your strong analytical skill sets, thorough understanding of machine learning, online auctions and experience in managing an engineering team you’ll advance the state of the art in ML and auction theory while at the same time unlock Pinterest’s monetization potential.  In short, this is a unique position, where you’ll get the freedom to work across the organization to bring together pinners and partners in this unique marketplace.

What you’ll do: 

  • Manage the ads delivery engineering organization, consisting of managers and engineers with a background in ML, backend development, economics and data science
  • Develop and execute a vision for ads marketplace and ads delivery funnel
  • Build strong XFN relationships with peers in Ads Quality, Monetization and the larger engineering organization, as well as with XFN partners in Product, Data Science, Finance and Sales

What we’re looking for:

  • MSc. or Ph.D. degree in Economics, Statistics, Computer Science or related field
  • 10+ years of relevant industry experience
  • 5+ years of management experience
  • XFN collaborator and a strong communicator
  • Hands-on experience building large-scale ML systems and/or Ads domain knowledge
  • Strong mathematical skills with knowledge of statistical models (RL, DNN)

#LI-TG1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like