Pinterest Druid Holiday Load Testing

535
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Isabel Tallam | Senior Software Engineer; Jian Wang | Senior Software Engineer; Jiaqi Gu| Senior Software Engineer; Yi Yang | Senior Software Engineer; and Kapil Bajaj | Engineering Manager, Real-time Analytics team


Like many companies, Pinterest sees an increase in traffic in the last three months of the year. We need to make sure our systems are ready for this increase in traffic so we don’t run into any unexpected problems. This is especially important as Pinners come to Pinterest at this time for holiday planning and shopping. Therefore, we do a yearly exercise of testing our systems with additional load. During this time, we verify that our systems are able to handle the expected traffic increase. On Druid we look at several checks to verify:

  • Queries: We make sure the service is able to handle the expected increase in QPS while at the same time supporting the P99 Latency SLA our clients need.
  • Ingestion: We verify that the real-time ingestion is able to handle the increase in data.
  • Increase in Data size: We confirm that the storage system has sufficient capacity to handle the increased data volume.

In this post, we’ll provide details about how we run the holiday load test and verify Druid is able to handle the expected increases mentioned above.

Pinterest traffic increases as users look for inspiration for holidays.

How We Run Load Tests

As mentioned above, the areas our teams focus on are:

  • Can the system handle increased query traffic?
  • Can the system handle the increase in data ingestion?
  • Can the system handle the increase in data volume?

Can the System Handle Increased Query Traffic?

Testing query traffic and SLA is a main goal during holiday load testing. We have two different options for load testing in our Druid system. The first option generates queries based on the current data set in the Druid data and then runs these queries in Druid. The other option captures real production queries and re-runs these queries in Druid. Both of these options have their advantages and disadvantages.

Sample Versus Production Queries

The first option — using generated queries — is fairly simple to run anytime and does not require preparation like capturing queries. However, this type of testing may not accurately show how the system will behave in production scenarios. A real production query may look different and touch different data, query types, and timeframes than what is tested using generated queries. Additionally, any corner cases would be ignored in this type of testing.

The second option has the advantage of having real production queries that would be very similar to what we expect to see during any future traffic. The disadvantage here, however, is that setting up the tests is more involved, as production queries need to be captured and potentially need to be updated to match the new timeline when holiday testing is performed. In Druid, running the same query today versus one week from today may give different latency results, as data will move through different host stages in which data is supported by faster high-memory hosts in the first days/weeks versus slower disk stages for older data.

We decided to move ahead with real production queries because one of our priorities was to replicate production use cases as closely as possible. We made use of a Druid native feature that automatically logs any query that is being sent to a Druid broker host (broker hosts handle all the query work in a Druid cluster).

Test Environment Setup

Holiday testing is not done in the production environment, as this could adversely impact the production traffic. However, the test needs an environment setup as similar to the production environment as possible. Therefore, we created a copy of the production environment that is short-lived and solely used for testing. To test query traffic, the only stages required are brokers, historical stages, and coordinators. We have several tiers of historical stages in the production environment and we replicated the same setup in the test environment as well. We also made sure to use the same host machine types, configurations, pool size, etc.

The data we used for testing was copied over from production. We used a simple MySQL dump to create a copy of all the segments stored in the production environment. Once the dump is added to the MySQL instance in the test environment, the coordinator will automatically trigger the data to be replicated in the historical stages of the test environment.

Before initiating the copy, however, we needed to identify what data is required. This will depend on the client team and on the timeframe their queries request. In some cases, it may not be necessary to copy all data, but only the most recent days, weeks, or months.

Test environment is set up with the same configuration and hosts as Prod environment.

Our test system first connects to the broker hosts on the test environment, then loads the queries from the log file and sends them to the broker hosts. We use a multi-threaded implementation to increase the QPS being sent to the broker nodes. First, we run tests to identify how many threads are needed as a baseline that matches production traffic — for example, 300 QPS. Based on that, we can define how many threads we need to use for testing expected holiday traffic (two, three, or more times the standard traffic).

In our use case, we had loaded the data received up to a specific date (e.g. October 1st). At this point, we were re-running the captured log files on the same date or the day before, to match production behavior. Our test script also was able to update the time frame in a query to match either the current time or a predefined time to allow running any log file and translating it to the data available on the test environment.

Evaluating the Results

To determine the health of our system, we used our existing metrics to compare QPS and P99 latency on brokers and historical nodes, as well as determining system health via indicators like CPU usage of the brokers. These metrics help us identify any bottlenecks.

Query response time with normal traffic and 2x increase on basic system setup.

Typical bottlenecks can include the historical nodes or the broker nodes.

The historical nodes may show a higher latency for increased QPS, which will in turn increase the overall latency. To resolve this, we would add mirror hosts and increase the number of replicas of the data to support better latency under higher load. This step is something that will take time to implement, as hosts need to be added and data needs to be loaded, which can take several hours depending on the data size. Therefore, this is something that should be completed before traffic increases on the production system.

If the broker nodes are no longer able to handle the incoming query traffic, the size of the broker pool needs to be increased. If this is seen in the test environment, or even the production environment, it is much faster to increase the pool size and can potentially be done ad-hoc as well.

Testing with an increased data size on the test environment helps us determine which steps are needed to support the expected holiday traffic changes. We can make these configuration changes in advance, and we can make the support team aware of changes and of the maximum traffic the system is able to handle within the specified SLA (QPS and P99 latency requirements from the client teams).

Can the system handle the increase in data ingestion?

Testing the capacity for real-time data ingestion is similar to testing query performance. It is possible to start with making an estimate of the supported ingestion rate based on the dimensions/cardinality of the ingested data. However, this is only a guideline, and for some high-priority use cases it is a good idea to test early on.

We set up a test environment that has the same capacity, configuration, etc. as the production environment. However, in this step, some help from client teams may be required as we also need to test with increased data from the ingestion source like Kafka topic.

When reviewing the ingestion test, we focused on several key metrics. The ingestion lag should be low, and the number of both successful and rejected events (due to rejection window exceeded) should be closely similar to comparable values in the production environment. We also include validation of ingested data and general system health of overlord and middle manager stages — the stages handling ingestion of real time data.

Sample metrics for successfully ingested events, rejected events and kafka ingestion lag.

Sample metrics for successfully ingested events, rejected events and kafka ingestion lag.

Sample metrics for successfully ingested events, rejected events and kafka ingestion lag.

Can the system handle the increase in data volume?

Evaluating if the system can handle the increase in data volume is probably the simplest and quickest check, though just as important as the previous steps. For this, we take a look at the coordinator UI: here we can see all historical stages, the pool size, and at what capacity they are currently running. Once clients provide details on the expected increase in data volume, it is a fairly simple process to calculate the amount of additional data that needs to be stored over the holiday period and potentially some period after that.

The space is at a healthy percentage (~70%) allowing for some growth.

Results

In the tests we ran this year, we found that our historical stages are in a very good state and are able to handle the additional traffic expected during the holiday time. We did see, however, that the broker pool may need some additional hosts if traffic meets a certain threshold. We have been sure to keep this communication visible with the client teams and support teams so team members are aware and know that the pool size may need to be increased.

Learnings

Timing is very critical with holiday testing. This project has a fixed end date by which all changes need to be completed in the systems before any traffic increases, and the teams need to make sure to have all the pieces in place before results are due. As is true of many projects as well as this one, we need to leave additional buffer time for unexpected changes in timeline and requirements.

Druid is a backend service, which is not always top of mind for many client teams as long as it is performing well. Therefore, is it a good idea to reach out to client teams before testing starts to get their estimation of expected Holiday traffic increases. Some of our clients reached out to us on their own; however, the due date for any capacity increase requests to governance teams would have already passed. In these cases, or where client teams are not sure yet, it is a good practice to make a general estimation on traffic increase and start testing with those numbers.

Keeping track of holiday planning and applied changes for each year is also a good practice. Having a history of changes every year and keeping track of the actual increase versus the original estimates made beforehand will help to make educated estimates on what traffic increases may be expected in the following year.

Knowing the details on the capacity of brokers and historical stages before the holiday updates makes it easier for teams to evaluate what capacities to reduce the clusters after the holidays as well as considering organic growth on a per-month basis.

Future Work

In this year’s use case, we chose the option of capturing broker logs to retrieve the queries we wanted to re-play back to Druid. This option worked for us at this time, though we are planning to look into other options for capturing queries going forward. The log files option works well for a one-off need, but it would be useful to have continuous logging of queries and storing these in Druid. This can help with debugging issues and identifying high-latency queries that may need some tweaking to get performance improvements.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Machine Learning Engineer
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

With more than 400 million users around the world and 300 billion ideas saved, Pinterest Machine Learning engineers build personalized experiences to help Pinners create a life they love. With just over 3,000 global employees, our teams are small, mighty, and still growing. At Pinterest, you’ll experience hands-on access to an incredible vault of data and contribute large-scale recommendation systems in ways you won’t find anywhere else.

What you’ll do:

  • Build cutting edge technology using the latest advances in deep learning and machine learning to personalize Pinterest
  • Partner closely with teams across Pinterest to experiment and improve ML models for various product surfaces (Homefeed, Ads, Growth, Shopping, and Search), while gaining knowledge of how ML works in different areas
  • Use data driven methods and leverage the unique properties of our data to improve candidates retrieval
  • Work in a high-impact environment with quick experimentation and product launches
  • Keeping up with industry trends in recommendation systems 

 

What we’re looking for:

  • 2+ years of industry experience applying machine learning methods (e.g., user modeling, personalization, recommender systems, search, ranking, natural language processing, reinforcement learning, and graph representation learning)
  • End-to-end hands-on experience with building data processing pipelines, large scale machine learning systems, and big data technologies (e.g., Hadoop/Spark)
  • Nice to have:
    • M.S. or PhD in Machine Learning or related areas
    • Publications at top ML conferences
    • Expertise in scalable realtime systems that process stream data
    • Passion for applied ML and the Pinterest product

 

#LI-HYBRID
#LI-LA1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

iOS Engineer, Product
San Francisco, CA, US; New York City, NY, US; Portland, OR, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

We are looking for inquisitive, well-rounded iOS engineers to join our Product engineering teams. Working closely with product managers, designers, and backend engineers, you’ll play an important role in enabling the newest technologies and experiences. You will build robust frameworks & features. You will empower both developers and Pinners alike. You’ll have the opportunity to find creative solutions to thought-provoking problems. Even better, because we covet the kind of courageous thinking that’s required in order for big bets and smart risks to pay off, you’ll be invited to create and drive new initiatives, seeing them from inception through to technical design, implementation, and release.

What you’ll do:

  • Build out Pinner-facing frontend features in iOS to power the future of inspiration on Pinterest
  • Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B tests, to architecting and building solutions that can scale to support millions of users
  • Partner with design, product, and backend teams to build end to end functionality
  • Put on your Pinner hat to suggest new product ideas and features
  • Employ automated testing to build features with a high degree of technical quality, taking responsibility for the components and features you develop
  • Grow as an engineer by working with world-class peers on varied and high impact projects

What we’re looking for:

  • Deep understanding of iOS development and best practices in Objective C and/or Swift, e.g. xCode, app states, memory management, etc
  • 2+ years of industry iOS application development experience, building consumer or business facing products
  • Experience in following best practices in writing reliable and maintainable code that may be used by many other engineers
  • Ability to keep up-to-date with new technologies to understand what should be incorporated
  • Strong collaboration and communication skills

Product iOS Engineering teams: 

Creator Incentives 

Home Product

Native Publishing

Search Product

Social Growth

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

iOS Engineer, Product Excellence
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

On the Product Quality team you partner closely with designers, product leaders and engineers across the company to tackle initiatives that span across all of our features and embed with feature teams to increase adoption of best practices. 

What you’ll do:

  • Drive key initiatives that span across all of our product surfaces e.g. redesign or improving accessibility
  • Improve feature quality by embedding with the feature teams to increase adoption of our reusable components and implement best practices
  • Own and improve the chrome of the app including navigation in between features

What we’re looking for:

  • Deep knowledge of iOS development either in Objective-C or Swift
  • Knowledge with iOS design principles and accessibility best practices
  • Experience partnering with designers to implement design
  • Experience in understanding large code bases, including techniques to help keep them maintainable
  • Experience with A/B experiments and data analysis
  • Ok with ambiguity and self-driving ideas & initiatives
  • Excited to jump into different areas of the codebase 
  • Strong collaboration and communication skills

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Machine Learning Engineer, Core Engi...
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

With more than 400 million users around the world and 300 billion ideas saved, Pinterest Machine Learning engineers build personalized experiences to help Pinners create a life they love. With just over 3,000 global employees, our teams are small, mighty, and still growing. At Pinterest, you’ll experience hands-on access to an incredible vault of data and contribute large-scale recommendation systems in ways you won’t find anywhere else.

What you’ll do:

  • Build cutting edge technology using the latest advances in deep learning and machine learning to personalize Pinterest
  • Partner closely with teams across Pinterest to experiment and improve ML models for various product surfaces (Homefeed, Ads, Growth, Shopping, and Search), while gaining knowledge of how ML works in different areas
  • Use data driven methods and leverage the unique properties of our data to improve candidates retrieval
  • Work in a high-impact environment with quick experimentation and product launches
  • Keeping up with industry trends in recommendation systems 

 

What we’re looking for:

  • 2+ years of industry experience applying machine learning methods (e.g., user modeling, personalization, recommender systems, search, ranking, natural language processing, reinforcement learning, and graph representation learning)
  • End-to-end hands-on experience with building data processing pipelines, large scale machine learning systems, and big data technologies (e.g., Hadoop/Spark)
  • Nice to have:
    • M.S. or PhD in Machine Learning or related areas
    • Publications at top ML conferences
    • Expertise in scalable realtime systems that process stream data
    • Passion for applied ML and the Pinterest product

 

#LI-HYBRID
#LI-LA1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Verified by
Software Engineer
Sourcer
Software Engineer
Talent Brand Manager
Tech Lead, Big Data Platform
Security Software Engineer
You may also like