Pinterest Visual Signals Infrastructure: Evolution from Lambda to Kappa Architecture

Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Ankit Patel | Software engineer, Content Acquisition and Media Platform

With the growing need for machine learning signals from Pinterest’s huge visual dataset, we decided to take a closer look at our infrastructure that produces and serves these signals. A few parameters we were particularly interested in were signal availability, infra complexity and cost optimization, tech integration, developer velocity, and monitoring. In this post, we will describe our journey from a Lambda architecture to the new real-time signals infrastructure inspired by Kappa architecture.


In order to understand the existing visual signals infrastructure, we need to understand some of the basic content processing systems at Pinterest. Pinterest’s Content Acquisition and Media Platform, formerly known as Video and Image Platform (VIP), is responsible for ingesting, processing, and serving all of Pinterest’s content on every surface of the application. We ingest media at a massive scale every single day. This post will not go into details about the ingestion and serving part, and it will mostly focus on the processing part, as that is where most of the magic happens.

Media is ingested through 50 different pipelines. Pipelines are namespaces in VIP systems, e.g. Pinner uploaded content, crawled images, shopping images, video keyframes, user profile images, etc. Each pipeline maps to custom media processing configurations tailored for the use case it serves.

When we started building our visual signals processing infrastructure, we utilized this existing namespace philosophy and also partitioned our visual signals around pipelines. Pinterest’s homegrown signal processing and serving tech stack is called Galaxy. Namespaces keyed by different entity IDs in Galaxy are called Joins or Galaxies. Each VIP pipeline is mapped to their equivalent Galaxy. Sometimes we combine multiple similar pipelines into a single galaxy, as they are closely related.

Lambda Architecture

Until recently, we used the Lambda architecture illustrated below to compute visual signals from our media content.

As you can see in the diagram above, there are 2 modes to this architecture: online and offline. Let’s discuss the offline architecture first.

  1. A nightly workflow (Hadoop based map-reduce job) is scheduled to run. We calculate the delta unique image signatures from that particular day and create a new partition.
  2. We then create batches of image signatures and enqueue PinLater jobs for processing these batches.
  3. We spin up a GPU cluster with the ML models needed for the signal computation and then start processing the PinLater Jobs. This is an expensive cluster.
  4. We download the image from Amazon S3, run the model inference on them, and then store the output in S3. We would then spin down the GPU cluster to optimize EC2 spend.
  5. The signal is generated in a delimited bytes format with Protobuf encoded values.
  6. We kickoff a Workflow which transforms this output into equivalent Apache Thrift encoding because Pinterest heavily uses Thrift as its wire format and storage for data.
  7. Thrift output is stored in a Parquet columnar Hive table. Downstream batch clients consume this output.
  8. In order to support real time RPC clients, we have a final workflow that uploads the Hive partition to a Rockstore key value pair database.
  9. Galaxy Signal Service provides an RPC API to look up these signals from Rockstore for our online clients.

As you can see, there are a lot of moving pieces in the above design. Even though these processes matured over time and became more stable, VIP team engineers faced frequent issues while on call. Some of the biggest concerns were:

  • Workflows depending on other workflow
  • Consumers have to wait 24 hours for the signals to be available
  • Application logic issues are extremely hard to debug
  • Granular retries don’t exist, it’s all or nothing
  • Delays due to the nature of these systems

As we continued to build additional features that are powered by these machine learning signals, the need for producing these signals faster became a priority across the company. That is where the online mode comes into picture:

  1. We create a PinLater job for each image signature in real time.
  2. In this Job, we call a GPU cluster running the machine learning model inference service to calculate a signal and directly store its result into the key value pair based Rockstore database. GSS (Galaxy Signal Service) would then serve this signal.
  3. We then publish an event through Kafka to notify the downstream consumers about the signal availability.
  4. Downstream clients consume this event and fetch the signal from GSS in real time.

VIP wasn’t the only team in Pinterest that was getting attracted to this new paradigm of signal processing. Other teams became excited about the idea of having their signals calculated in real time. It provided more visibility into the signal generation compared to the black box that is hadoop based workflows. This approach came with multiple benefits like:

  • Better developer experience
  • Easier to debug, test, and deploy changes
  • Granular retries
  • Low latency signals

However, it came with its share of cons as well. The main ones being:

  • Complex operations, like group bys and signal joins, are not possible with a simple event-driven processing framework like Pinlater. We needed a robust stream processing framework like Apache Flink.
  • The extra cost of running a duplicate GPU cluster that processes the same pins as the batch pipeline on a continuous basis.
  • There was no shared infrastructure to address common needs.

Kappa Architecture

Finally, the Signal Platform team at Pinterest saw an opportunity to address this concern for all signal developers and build the next version of signal development infrastructure called “Near-real-time Galaxy,” or simply NRTG. The technologies of choice were Apache Kafka and Apache Flink. Flink seemed like it was specifically designed to address the concerns mentioned above. Given that the Signal Platform team already had a framework in place to build signals on a batch technology using Galaxy Dataflow APIs, extending it to also work on a stream technology was arguably the best way forward without causing massive amounts of refactoring, rewriting, or just reinventing.

The VIP team decided to be among the early adopters of this initiative as our media signals are some of the most upstream in the whole tree of signals at Pinterest, so it would naturally be the easiest to onboard a platform while it is still being built. We scoped out the signals we wanted to experiment with and got started on this mission.

The gist of it is very simple: you would write a simple flink job that computes the signal in streaming (on Apache Flink). NRTG would make this process easy and quick by leveraging standard design patterns and annotations.

Based on the annotations, the NRTG framework mentioned above does most of the heavy lifting away hidden from the signal developers. The configs and the mapping to underlying native Flink is managed by this middleware layer. This makes signal development extremely fast because the developers do not need to learn Flink in detail as they are already familiar with these annotations. Xenon (Flink) platform team at Pinterest provides the infrastructure capabilities to deploy and maintain Flink applications.

Once we onboarded to NRTG, we wanted to turn down our existing batch workflows setup. There are a number of consumers who consume these signals in batch. We had to provide a solution that would work for them from our streaming pipelines. In order to simulate a daily Hive table for our signals, we wrote a simplified workflow that takes a daily dump of our KVStore and transforms it into existing Hive output. No GPU computations were required to recompute the signal values — the data was simply computed in streaming, moved to S3 via a daily dump, and filtered to the correct format. This not only allowed us to save on the GPU cost but also trim down chained complex workflows design into a simple data transformation job. With the FlinkSQL being in active development, we will be able to completely migrate the offline portion from Spark/Hadoop to Flink.


Migration to this new fast-signals infrastructure is the beginning of a great future for Pinterest in signal generation. It allows the signal developers to quickly build signals with a lot less learning curve. Underlying Flink capabilities also support advanced signals design. Even though batch backfill support is a work in progress in NRTG and the signal producers need to adapt outputs to avoid disruption to their consumers, the benefits still outweigh the costs of duplication in the existing lambda infrastructure. NRTG team already has this in the roadmap to offer end to end support by providing Hive integration as part of the framework. Bringing the end to end lifecycle of a signal under one platform would massively benefit the innovation and productizing ideas across different teams at Pinterest. It has reduced the infra complexity, and we are able to leverage cost optimization on GPU and other compute resources. We expect other teams at Pinterest to follow in the same footsteps and boost their developer velocity by moving to a more simple and robust architecture as outlined in this blog.

This project is a joint effort across multiple teams at Pinterest: Video & Image Platform (VIP), Near-real time Galaxy (NRTG), Xenon, Hermes, Rockstore, and Visual Search.

Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Machine Learning Engineer, Homefeed R...
San Francisco, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. As a Pinterest employee, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping users make their lives better in the positive corner of the internet.

Homefeed is a discovery platform at Pinterest that helps users find and explore their personal interests. We work with some of the largest datasets in the world, tailoring over billions of unique content to 330M+ users. Our content ranges across all categories like home decor, fashion, food, DIY, technology, travel, automotive, and much more. Our dataset is rich with textual and visual content and has nice graph properties — harnessing these signals at scale is a significant challenge. The Homefeed ranking team focuses on the machine learning model that predicts how likely a user will interact with a certain piece of content, as well as leveraging those individual prediction scores for holistic optimization to present users with a feed of diverse content.

What you’ll do:

  • Work on state-of-the-art large-scale applied machine learning projects
  • Improve relevance and the user experience on Homefeed
  • Re-architect our deep learning models to improve their capacity and enable more use cases
  • Collaborate with other teams to build/incorporate various signals to machine learning models
  • Collaborate with other teams to extend our machine learning based solutions to other use cases

What we’re looking for:

  • Passionate about applied machine learning and deep learning
  • 8+ years experience applying machine learning methods in settings like recommender systems, search, user modeling, image recognition, graph representation learning, natural language processing


EPM Lead Developer, Adaptive Planning...
San Francisco, CA


The EPM technology team at Pinterest is looking for a senior EPM architect who has at least four years of technical experience in Workday Adaptive Planning. You will be the solutions architect who oversees technical design of the complete EPM ecosystem with emphasis on Adaptive Financial and Workforce planning. The right candidate will also need to have hands-on development experience with Adaptive Planning and related technologies. The role is in IT but will work very closely with FP&A and the greater Finance/Accounting teams. Experience with Tableau suite of tools is a plus.

What you'll do: 

  • Together with the EPM Technology team, you will own Adaptive Planning and all related services
  • Oversee architecture of existing Adaptive Planning solution and make suggestions for improvements
  • Solution and lead Adaptive Planning enhancement projects from beginning to end
  • Help EPM Technology team gain deeper understanding of Adaptive Planning and train the team on Adaptive Planning best practices
  • Establish strong relationship with Finance users and leadership to drive EPM roadmap for Adaptive Planning and related technologies
  • Help establish EPM Center of Excellence at Pinterest
  • This is a contract position at Pinterest. As such, the contractor who fills this role will be employed either by our staffing partner (ProUnlimited) or by an agency partner, and not an employee of Pinterest.
  • All interviews will be scheduled and/or conducted by the Pinterest assignment manager. When a finalist has been selected, ProUnlimited or the agency partner will extend the offer and provide assignment details including duration, benefits options and onboarding details.

What we're looking for: 

  • Hands-on design and build experience with all Adaptive Planning technologies: standard sheets, cube sheets, all dimensions, reporting, integration framework, security, dashboarding and OfficeConnect
  • Strong in application design, data integration and application project lifecycle
  • Comfortable working side-by-side with business
  • Ability to translate business requirements to technical requirements
  • Strong understanding in all three financial statements and the different enterprise planning cycles
  • Familiar with Tableau suite of tools


Machine Learning Engineer, Content Si...
Toronto, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. As a Pinterest employee, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping users make their lives better in the positive corner of the internet.

On the Content Signals team, you’ll be responsible for building machine learning signals from NLP and CV components to productionizing the end product in batch and real-time setting at Pinterest scale. Our systems offer rich semantics to the recommendation platform and enable the product engineers to build deeper experiences to further engage Pinners. In understanding structured and unstructured content, we leverage embeddings, supervised and semi-supervised learning, and LSH. To scale our systems we leverage Spark, Flink, and low-latency model serving infrastructure.

What you’ll do:

  • Apply machine learning approaches to build rich signals that enable ranking and product engineers to build deeper experiences to further engage Pinners
  • Own, improve, and scale signals over both structured and unstructured content that bring tens of millions of rich content to Pinterest each day
  • Drive the roadmap for next-generation content signals that improve the content ecosystem at Pinterest.

What we’re looking for:

  • Deep expertise in content modeling at consumer Internet scale
  • Strong ability to work cross-functionally and with partner engineering teams
  • Expert in Java, Scala or Python


Senior Software Engineer, Shopping Co...
Toronto, CA

Pinterest is aiming to build a world-class shopping experience for our users, and has a unique advantage to succeed due to the high shopping intent of Pinners. The new Shopping Content Mining team being founded in Toronto plays a critical role in this journey. This team is responsible for building a brand new platform for mining and understanding product data, including extracting high quality product attributes from web pages and free texts that come from all major retailers across the world, mining product reviews and product relationships, product classification, etc. The rich product data generated by this platform is the foundation of the unified product catalog, which powers all shopping experiences at Pinterest (e.g., product search & recommendations, product detail page, shop the look, shopping ads).

There are unique technical challenges for this team: building large scale systems that can process billions of products, Machine Learning models that require few training examples to generate wrappers for web pages, NLP models that can extract information from free-texts, easy-to-use human labelling tools that generate high quality labeled data. Your work will have a huge impact on improving the shopping experience of 400M+ Pinners and driving revenue growth for Pinterest.

What you’ll do:

  • As a backend engineer, design and build large scale systems that can process billions of products, e.g., information extraction systems using XML parsers.
  • Design and build systems / tools, e.g.,
    • UI for data labeling and ML model diagnostic
    • feature extraction for ML models
    • extraction template / model fast deployment
    • evaluation / outlier detection system for data quality
  • Drive cross functional collaborations with partner teams working on shopping

What we’re looking for:

  • 5+ years of industry experience
  • Expert in Python and Java
  • Hands-on experience with big data technologies (e.g., Hadoop/Spark) and scalable realtime systems that process stream data
  • Nice to have: Familiarity with information extraction techniques for web-pages and free-texts, Experience working with shopping data is a plus, Experience building internal tools for labeling / diagnosing, Basic knowledge of machine learning (or willing to learn!): feature extraction, training, etc.


Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Software Engineer
You may also like