How Pinterest Fights Spam Using Machine Learning

350
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Vishwakarma Singh | Trust and Safety Machine Learning Lead


Hundreds of millions of people regularly visit Pinterest to visually discover inspiring ideas among billions of Pins. Inspiration is a high bar and we must be vigilant in ensuring that Pinners don’t see spam, harmful content or misinformation. To enforce our community policies and maintain an inspiring environment, we use the latest in machine learning technology to build automated systems that swiftly detect and act against both spammy content and spammers.

Our anti-spam system consists of both reactive and proactive components to effectively counter adversarial abusers — users who intentionally try to evade the system. Our proactive system consists of sophisticated machine learning models, whereas the reactive system includes both rules executed in a real-time rules engine and lightweight machine learning models. We not only use the latest modeling techniques but also iterate on these models at regular intervals by adding new data and exploring new technical breakthroughs to either maintain or improve their performance over time to effectively address spam.

One tactic malicious actors enact is misusing a Pin’s image and linking to a malicious external website. Our models detect spam vectors, like Pin links, as well as users engaging in spammy behaviors. We quickly limit distribution of Pins with spam links and take direct action against users identified with a high confidence to be engaging in spammy behavior. We perform a manual review for those identified with low confidence to limit false positives, and we notify users of our actions to maintain transparency and also provide an option of appeal against our decision.

Machine Learning Models

Spam Domain Model

We proactively identify spam Pin links using a Deep Neural Network classifier (shown in Figure 1). To maximize impact, our model learns to classify a domain as spam rather than a link. We apply the same enforcement to all Pins with links belonging to the same domain. This model is trained interactively on manually labeled domains to achieve a higher recall and lower false positive rate. We use features created from links, web page text and media, user-domain interactions, and user behavior as inputs. For each domain, we sample links and webpages to create features. We semantically split links into semantic tokens and use only frequent tokens as features. We analyze outlying patterns in user actions over time to create behavioral features. This model is periodically batch inferred at scale by a PySpark job using Tensorflow, Spark SQL, and a UDF.

Figure 1. Deep Neural Network for domain classification

Spam User Model

Identifying users engaging in spam activities is the ultimate solution for fighting spam, but it is extremely hard to achieve. We leverage both supervised and unsupervised models to build an effective spam user identification system.

Classification Model

Our spam user classification model is a Deep Neural Network (shown in Figure 2) and is part of our proactive system. It is trained using synthetically labeled data generated with minimal human supervision to ensure quality. We use features created from user attributes and their past behaviors as inputs. We also use user-domain interaction, summarized as a domain scores distribution for each user where domain scores are reused from the spam domain model, as an input. This model is periodically batch inferred to score millions of Pinners by a PySpark job using Tensorflow, Spark SQL, and a UDF.

Figure 2. Deep Neural Network for user classification

Clustering

We have developed lightweight clustering models for early detection of suspicious users and bots. This technique also addresses gaps in our classification models, which are unaware of emerging patterns unless re-trained with fresh labeled data. We cluster users on attributes which can successfully isolate suspicious groups with high accuracy. Experts identify these attributes by exploring the behavior of suspicious users and their use of resources for creating spammy content. This model is implemented using PySpark and SparkSQL and executes daily.

Spam User-Domain Model

Interactions of users with domains are explicitly captured by a heterogeneous bipartite graph as shown in Figure 3. We represent users and domains as nodes in the graph and create an edge between a user and a domain if the user has created or saved a Pin with the domain’s link. This graph facilitates simultaneous identification of spam users and domains using a semi-supervised learning. We use a small set of labeled users and domains to run a label propagation algorithm and learn scores for the unlabeled users and domains. We implement this iterative algorithm in Spark and run it periodically.

Figure 3. Bipartite graph of users and domains for label propagation

Measurement

We measure spam prevalence on Pinterest by computing the number of Pin impressions which either have spam links or have been created by users engaging in spammy activities. We periodically sample and manually review both impressed Pins and users. We scaled our measurement by starting to sample and review from highly impressed head domains and then extended the coverage to tail domains over a period of time. These samples are used for measuring overall spam prevalence as well as training our machine learning models.

Conclusion

Pinterest’s mission is to bring everyone the inspiration to create a life they love. We strive to protect our Pinners’ experiences by swiftly and appropriately acting against malicious users and spam content as identified by our array of latest machine learning models. We plan to keep investing in evolving our community guidelines and technology to address inevitably emerging challenges and bring the best experience to our millions of valued users.

Acknowledgements

Thanks to Yuanfang Song, Omkar Panhalkar, Rundong Liu, Qinglong Zeng, Attila Dobi, Abhijit Mahabal, Alok Singhal, Maisy Samuelson, and the rest of the Trust and Safety team for their contributions in developing machine learning models for spam! Thanks to Harry Shamansky for helping with the publication of the blog post!

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Backend Engineer, Measurement User Match
Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our mission is to help advertisers gain a deep understanding of their ad performance and generate helpful insights so they can make good decisions about their ad campaigns. You’d design and build systems and services to help advertisers learn more about conversions, viewability, brand lift, sales lift, offline conversions, etc. We’re building end-to-end Big Data distributed systems using a board mix of leading open source and Cloud technologies and integrating with 3rd party tools that Advertisers already trust.

What you’ll do:

  • Increase visibility and scale of conversion capture to power our measurement, targeting, and auction products
  • Create cutting edge technical solutions to match conversion events to Pinners
  • Design and build conversion tags, APIs, and data processing algorithms around tracking and reporting against conversions

What we’re looking for:

  • 3+ years of software engineering experience
  • Experiences in developing backend large scale distributed services and data processing workflows in Java and Python

#LI-GK1

Engineering Manager, Shopping Content...
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is aiming to build a world-class shopping experience for our users, and has a unique advantage to succeed due to the high shopping intent of Pinners. The new Shopping Content Mining team being founded in Toronto plays a critical role in this journey. This team is responsible for building a brand new platform for mining and understanding product data, including extracting high quality product attributes from web pages and free texts that come from all major retailers across the world, mining product reviews and product relationships, product classification, etc. The rich product data generated by this platform is the foundation of the unified product catalog, which powers all shopping experiences at Pinterest (e.g., product search & recommendations, product detail page, shop the look, shopping ads).

There are unique technical challenges for this team: building large scale systems that can process billions of products, Machine Learning models that require few training examples to generate wrappers for web pages, NLP models that can extract information from free-texts, easy-to-use human labelling tools that generate high quality labeled data.Your work will have a huge impact on improving the shopping experience of 400M+ Pinners and driving revenue growth for Pinterest.

What you’ll do:

  • As the Engineering Manager, you’ll be responsible for:
    • Growing this team further in Toronto
    • Driving execution and deliver impact
    • Setting long term technical visions for this area
  • Work with tech leads to provide technical guidance on:
    • Large scale systems that can process billions of products
    • ML models for wrapper induction that require few training examples, NLP models for understanding free-texts
  • Drive cross functional collaborations with partner teams working on shopping

What we’re looking for:

  • 7+ years of industry experience, including 2+ years of management experience
  • Experience on large scale machine learning systems (full ML stack from modelling to deployment at scale.)
  • Experience with big data technologies (e.g., Hadoop/Spark) and scalable realtime systems that process stream data

Nice to have:

  • PhD in Machine Learning or related areas, publication on top ML conferences
  • Familiarity with information extraction techniques for web-pages and free-texts.
  • Experience working with shopping data is a plus.
  • Experience building internal tools for labeling / diagnosing.

#LI-EA1

Staff Machine Learning Software Engin...
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Shopping is at the core of Pinterest’s mission to help people create a life they love. The shopping discovery team at Pinterest is inventing a brand new, more visual and personalized shopping experience for 350M+ users worldwide. The team is responsible for delivering mid-funnel shopping experience on shopping surfaces like Product Detail Page, Shopping Search, Shopping on Board etc. As an engineer of the team you will be working on the most cutting edge recommendation algorithms to develop diverse types of shopping recommendations that will be displayed across different shopping surfaces on Pinterest. 

You’ll also be responsible for optimizing the whole page layout by appropriately selecting and slotting the UI templates and recommendation modules optimizing towards a shopping metric. As an engineer of the team you’ll be running experiments and directly improving the shopping metrics contributing to the bottom line of the company.

If you are excited about large scale machine learning problems in the area of recommendation, search and whole page optimization then you must consider this role

What you'll do: 

  • Develop large scale shopping recommendation algorithms
  • Build data pipelines to do data analysis and collect training data
  • Train deep learning models to improve quality and engagement of shopping recommenders
  • Work on backend and infrastructure to build, deploy and serve machine learning models
  • Develop algorithms to optimize the whole page layout of the shopping surfaces
  • Drive the roadmap for next generation of shopping recommenders

What we're looking for: 

  • 6+ years working experience in the area of applied Machine Learning
  • Interest or experience working on a large-scale search, recommendation and ranking problems
  • Interest and experience in doing full stack ML, including backend and ML infrastructure
  • Experience is any of the following areas
    • Developing large scale recommender systems
    • Contextual bandit algorithms
    • Reinforcement learning

#LI-JY1

Senior Machine Learning Engineer, Sho...
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is aiming to build a world-class shopping experience for our users, and has a unique advantage to succeed due to the high shopping intent of Pinners. The new Shopping Content Mining team being founded in Toronto plays a critical role in this journey. This team is responsible for building a brand new platform for mining and understanding product data, including extracting high quality product attributes from web pages and free texts that come from all major retailers across the world, mining product reviews and product relationships, product classification, etc. The rich product data generated by this platform is the foundation of the unified product catalog, which powers all shopping experiences at Pinterest (e.g., product search & recommendations, product detail page, shop the look, shopping ads).

There are unique technical challenges for this team: building large scale systems that can process billions of products, Machine Learning models that require few training examples to generate wrappers for web pages, NLP models that can extract information from free-texts, easy-to-use human labelling tools that generate high quality labeled data. Your work will have a huge impact on improving the shopping experience of 400M+ Pinners and driving revenue growth for Pinterest.

What you’ll do:

  • As a ML engineer, you will design and build large scale ML systems that can process billions of products
  • ML models for wrapper induction that require few training examples, NLP models for understanding free-texts
  • Drive cross functional collaborations with partner teams working on shopping

What we’re looking for:

  • 3+ years of industry experience
  • Hands-on experience on large scale machine learning systems (full ML stack from modelling to deployment at scale.)
  • Hands-on experience with big data technologies (e.g., Hadoop/Spark) and scalable realtime systems that process stream data
  • Nice to have: PhD in Machine Learning or related areas, publication on top ML conferences, Familiarity with information extraction techniques for web-pages and free-texts, Experience working with shopping data is a plus

#LI-EA1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like