Open Sourcing Querybook, Pinterest’s Collaborative Big Data Hub

1,178
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

An efficient big data solution for an increasingly remote-working world.

By Charlie Gu | Tech Lead, Analytics Platform, Lena Ryoo | Software Engineer, Analytics Platform, and Justin Mejorada-Pier | Engineering Manager, Analytics Platform

With more than 300 billion Pins, Pinterest is powering an ever-growing and unique dataset that maps interests, ideas, and intent. As a data-driven company, Pinterest uses data insights and analysis to make product decisions and evaluations to improve the Pinner experience for more than 450 million monthly users. To continuously make these improvements, especially in an increasingly remote environment, it’s more important than ever for teams to be able to compose queries, create analyses, and collaborate with one another. Today we’re taking Querybook, our solution for more efficient and collaborative big data access, and open sourcing it for the community.

The common starting point for any analysis at Pinterest is an ad-hoc query that gets executed on a SparkSQL, Hive, Presto cluster, or any Sqlalchemy compatible engine. We built Querybook to provide a responsive and simple web UI for such analysis so data scientists, product managers, and engineers can discover the right data, compose their queries, and share their findings. In this post, we’ll discuss the motivation to build Querybook, its features, architecture, and our work to open source the project.

The Journey

The proposal to build Querybook started in 2017 as an intern project. During that time, we used a vendor-supplied web application as the query UI. There were often user complaints about that tool regarding its UI, speed & stability, lack of visualizations, as well as difficulty in sharing. Before long, we realized there was a great opportunity to build a better querying interface.

We started to interview data scientists and engineers about their workflows while scoping out technical details. Shortly, we realized most were organizing their queries outside of the official tool, and many used apps like Evernote. Although Jupyter had a notebook user experience, its requirement to use Python/R and the lack of table metadata integration deterred many users. Based on this finding, our team decided Querybook’s query interface would be a document where users can compose queries and write analyses all in one place with the power of collocated metadata and the simplicity of a note-taking app.

Released internally in March 2018, Querybook became the official solution to query big data at Pinterest. Nowadays, Querybook on average has 500 DAUs and 7k daily query runs. With an internal user rating of 8.1/10, it’s one of the highest-rated internal tools at Pinterest.

Feature Highlights

Figure 1. Querybook’s Doc UI

When a user first visits, they‘ll quickly notice its distinctive DataDoc interface. This is the primary place for users to query and analyze. Each DataDoc is composed of a list of cells which can be one of three types: text, query, or chart.

  • The text cell comes with built-in rich-text support for users to jot down their ideas or insights.
  • The query cell is used to compose and execute queries.
  • The chart cell is used to create visualizations based on execution results. Similar to Google Docs, when users are granted access to a DataDoc, they can collaborate with each other in real-time.

With the intuitive charting UI, users can easily turn a DataDoc into an illustrative dashboard. You can choose from different visualization options, such as time-series, pie-charts, scatter plots, and more. You can then connect your visualization to the results of any query on your DataDoc and post-process them with sorting and aggregation as needed. To automatically update these charts, you can use the scheduling options and select your desired cadence. The scheduler can notify users of success or failure. Combined with the templating option powered by Jinja, creating a live updating DataDoc is very quick.

Scheduling and visualization features aren’t intended to replace tools such as Airflow or Superset. Rather, these features provide users a simple and fast way to experiment with their queries and iterate on them. Often, Pinterest engineers use Querybook as the first step to compose queries before creating production-level workflows and dashboards.

Last but not least, Querybook comes with an automated query analytics system. Every query executed gets analyzed to extract metadata such as referenced tables and query runners. Querybook uses this information to automatically update its data schema and search ranking, as well as to show a table’s frequent users and query examples. The more queries, the more documented the tables become.

Architecture

Figure 2. Overview of Querybook’s architecture

To understand how Querybook works, we’ll walk through the process of composing and executing a query.

  1. The first step is to create a DataDoc and write the query in a cell. While the user types, the user’s query gets streamed to the server via Socket.IO.
  2. The server then pushes the delta to all users reading that DataDoc via Redis. At the same time, the server would save the updated DataDoc in the database and create an async job for the worker to update the DataDoc content in ElasticSearch. This allows the DataDoc to be searched later.
  3. Once the query is written, the user can execute the query by clicking the run button. The server would then create a record in the database and insert a query job into the Redis task queue. The worker receives the task and sends the query to the query engine (Presto, Hive, SparkSQL, or any Sqlalchemy compatible engine). While the query is running, the worker pushes live updates to the UI via Socket.IO.
  4. When the execution is completed, the worker loads the query result and uploads it in batches to a configurable storage service (e.g. S3). Finally, the browser gets notified of the query completion and makes a request to the server to load the query result and display it to the user.

For brevity, this section only focused on one user flow of Querybook. However, all the infrastructure used is covered. Querybook allows some of it to be customized. For example, you can choose to upload execution results to either S3, Google Cloud Storage, or a local file. In addition, MySQL can also be swapped with any Sqlalchemy-compatible database such as Postgres.

The Path To Open Source

After noticing the success that Querybook had internally, we decided to open source it. One challenge we bumped into was how to make it generic while preserving some of the Pinterest-specific integrations. For this, we decided to have a two-layer organization through a plugin system and to add an Admin UI.

The Admin UI lets companies configure Querybook’s query engines, table metadata ingestion, and access permissions from a single friendly interface. Previously, these configurations were done inside configuration files and required a code change as well as a deployment to be reflected. With this new UI, admins can make live Querybook changes without going through code or config files.

Figure 3. The Admin UI

The plugin system integrates Querybook with the internal systems at Pinterest by utilizing Python’s importlib. With the plugin system, developers can configure auth, customize query engines, and implement exporters to internal sites. Customized behaviors provided by the plugin system allow Querybook to be optimized for the user’s workflow at Pinterest while ensuring the open-source is generic for the public.

You can check out more of Querybook’s features and its documentation on Querybook.org, and you can reach us at querybook@pinterest.com.

Acknowledgments: We want to thank the following engineers that have made contributions to Querybook: Lauren Mitchell, Langston Dziko, Mohak Nahta, and Franklin Shiao. And to Chunyan Wang, Dave Burgess, and David Chaiken for their critical advice and support.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Engineering Manager, Shopping Content...
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is aiming to build a world-class shopping experience for our users, and has a unique advantage to succeed due to the high shopping intent of Pinners. The new Shopping Content Mining team being founded in Toronto plays a critical role in this journey. This team is responsible for building a brand new platform for mining and understanding product data, including extracting high quality product attributes from web pages and free texts that come from all major retailers across the world, mining product reviews and product relationships, product classification, etc. The rich product data generated by this platform is the foundation of the unified product catalog, which powers all shopping experiences at Pinterest (e.g., product search & recommendations, product detail page, shop the look, shopping ads).

There are unique technical challenges for this team: building large scale systems that can process billions of products, Machine Learning models that require few training examples to generate wrappers for web pages, NLP models that can extract information from free-texts, easy-to-use human labelling tools that generate high quality labeled data.Your work will have a huge impact on improving the shopping experience of 400M+ Pinners and driving revenue growth for Pinterest.

What you’ll do:

  • As the Engineering Manager, you’ll be responsible for:
    • Growing this team further in Toronto
    • Driving execution and deliver impact
    • Setting long term technical visions for this area
  • Work with tech leads to provide technical guidance on:
    • Large scale systems that can process billions of products
    • ML models for wrapper induction that require few training examples, NLP models for understanding free-texts
  • Drive cross functional collaborations with partner teams working on shopping

What we’re looking for:

  • 7+ years of industry experience, including 2+ years of management experience
  • Experience on large scale machine learning systems (full ML stack from modelling to deployment at scale.)
  • Experience with big data technologies (e.g., Hadoop/Spark) and scalable realtime systems that process stream data

Nice to have:

  • PhD in Machine Learning or related areas, publication on top ML conferences
  • Familiarity with information extraction techniques for web-pages and free-texts.
  • Experience working with shopping data is a plus.
  • Experience building internal tools for labeling / diagnosing.

#LI-EA1

Staff Machine Learning Software Engin...
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Shopping is at the core of Pinterest’s mission to help people create a life they love. The shopping discovery team at Pinterest is inventing a brand new, more visual and personalized shopping experience for 350M+ users worldwide. The team is responsible for delivering mid-funnel shopping experience on shopping surfaces like Product Detail Page, Shopping Search, Shopping on Board etc. As an engineer of the team you will be working on the most cutting edge recommendation algorithms to develop diverse types of shopping recommendations that will be displayed across different shopping surfaces on Pinterest. 

You’ll also be responsible for optimizing the whole page layout by appropriately selecting and slotting the UI templates and recommendation modules optimizing towards a shopping metric. As an engineer of the team you’ll be running experiments and directly improving the shopping metrics contributing to the bottom line of the company.

If you are excited about large scale machine learning problems in the area of recommendation, search and whole page optimization then you must consider this role

What you'll do: 

  • Develop large scale shopping recommendation algorithms
  • Build data pipelines to do data analysis and collect training data
  • Train deep learning models to improve quality and engagement of shopping recommenders
  • Work on backend and infrastructure to build, deploy and serve machine learning models
  • Develop algorithms to optimize the whole page layout of the shopping surfaces
  • Drive the roadmap for next generation of shopping recommenders

What we're looking for: 

  • 6+ years working experience in the area of applied Machine Learning
  • Interest or experience working on a large-scale search, recommendation and ranking problems
  • Interest and experience in doing full stack ML, including backend and ML infrastructure
  • Experience is any of the following areas
    • Developing large scale recommender systems
    • Contextual bandit algorithms
    • Reinforcement learning

#LI-JY1

Software Engineer, Sales Tools
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The amount of advertisers on Pinterest is growing faster than the sales team, necessitating new investments in driving sales productivity through Tooling. Your customers would be internal, and you’d become an expert at how the whole sales motion works. Friction is your enemy, and happy productive sales people is your outcome. We’re looking for motivated and self starting individuals to evolve existing, and build new tooling from scratch. You’ll work closely with internal customers from many disciplines (product, operations, sales) to rapidly deliver creative solutions in an iterative manner. 

What you’ll do:

  • Design and develop internal tools to improve efficiency of sales teams and processes
  • Architect, deploy and maintain performant and reliable systems, building for quick iteration and reusability
  • Work closely with internal customers from product management, sales and operations to craft fit for purpose tools
  • Re-think how current processes can be made better through data enrichment, connecting systems and automation
  • Define new metrics and systems for observing, evaluating and further optimizing business processes

What we’re looking for:

  • 3+ years of software engineering experience
  • Experiences in developing backend large scale services and data processing workflows in Java
  • Experience utilizing big data processing systems such as Hive/Spark/Presto. 
  • Strong developer. Loves coding and constructing technical solutions
  • Effective collaboration with other teams

#LI-GK1

Software Engineer, Pinterest Labs – I...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

As a Software Engineer in Pinterest Labs, you'll work on tackling new challenges in machine learning and deep learning applied to a unique Pinterest dataset of 250 billion pins. You'll work on critical machine learning applications, push the state of the art, and build models and systems that are applied across Pinterest engineering teams to be used by hundreds of millions of users at tens to hundreds of thousands of QPS. You'll have the opportunity to work in the following areas: ML fairness, representation learning, graph embeddings, image recognition, user modeling, search and recommender systems, and natural language processing. 

The goal of Inclusive AI is to develop AI systems that perform outstandingly well across our diverse set of users and our wide range of applications. You will advance the state of the art in AI fairness, performing applied research in algorithmic bias, fairness and diversity for search and recommendation systems, computer vision models, representation learning, and more

What you’ll do: 

  • Advance the state-of-the-art in AI Fairness for large scale AI systems, including applied research in algorithmic bias, and diversity for search and recommendation systems
  • Develop ML models and deploy in large-scale distributed ML systems to enable inclusive and diverse recommendations at scale.
  • Work in a fast-paced environment with a quick cadence of research, experimentation, and product launches
  • Impact hundreds of millions of users by developing the next generation of inclusive visual discovery technology

What we’re looking for: 

  • Passionate about AI fairness, diversity, machine learning, and search and recommendation systems
  • PhD, or Masters degree with industry experience, in a technical field (EECS, Stats, Engineering, Maths)
  • Inquisitive engineer with 2+ years of industry experience in Search and Recommendation systems; preferably, but not required to be, related to algorithmic bias, AI fairness, and/or diversity
  • Ability to collaborate with multiple engineering, product and non-technical teams in a cross-functional environment
  • Python, Java programming experience
  • Tensorflow OR PyTorch experience
  • Experience with large scale data processing (e.g. Spark)
  • Industry experience in deploying ML/DL models into production (familiarity with scalability/latency/portability concerns, experience with experimentation and hyperparameter tuning)
  • Strong passion for experimentation and extensive experience in solving hard ML problems

#LI-TG1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like