Scaling PostgreSQL at Thumbtack: Load Balancing And Health Checks

5,619

By Marco Almeida, Site Reliability Engineer at Thumbtack.


Introduction

Running PostgreSQL on a single primary master node is simple and convenient. There is a single source of truth, one instance to handle all reads and writes, one target for all clients to connect to, and only a single configuration file to maintain. However, such a setup usually does not last forever. As traffic increases, so does the number of concurrent reads and writes, the read/write ratio may become too high, a fast and reliable recovery plan needs to exist, the list goes on…

No single approach solves all possible scaling challenges, but there are quite a few options for scaling PostgreSQL depending on the requirements. When the read/write ratio is high enough, there is fairly straightforward scaling strategy: setup secondary PostgreSQL nodes (replicas) that stream data from the primary node (master) and split SQL traffic by sending all writes (INSERT, DELETE, UPDATE, UPSERT) to the single master node and all reads (SELECT) to the replicas. There can be many replicas, so this strategy scales better with a higher read/write ratio. Replicas are also valuable to implement a disaster recovery plan as it’s possible to promote one to master in the event of a failure.

Context

In 2014, Thumbtack was running PostgreSQL 9.1 on two servers: a basic master – slave setup leveraging PostgreSQL’s built-in streaming replication. Our infrastructure was comprised of a few dozen physical machines on SoftLayer running RHEL 5 and we were using HAproxy with Keepalived for load balancing. The future, already being planned for, would be powered by EC2 instances on AWS, running Debian 7 behind Elastic Load Balancers.

As traffic grew, we knew we would need to scale out PostgreSQL further. Thumbtack’s SQL traffic was (and still is) quite read-intensive, with less than 3% of all queries being executed on the master node. This was good news as it meant we could scale out by sending SELECT statements to a cluster of read-only replicas and leaving the master alone to process DML commands.

In order to properly implement this we would need:

  • an arbitrary number of read-only replicas behind a load balancer;
  • the load balancer itself could not be a single point of failure;
  • a way of performing health checks on each server, executed from the load balancer, so that failed nodes would be taken in and out of rotation automatically;
  • to support SoftLayer and AWS environments during the transition period.

Replication, high-availability, and load-balancing

We knew what we wanted the infrastructure to look like from a high-level perspective and had the tools available to implement almost all of it on both providers (Fig. 1).

Thumbtack Postgres Acrhictecture

One critical detail, however, was far from being a solved problem: health checks.

A basic ping on port 5432 was not enough. Performance and replication lag were (and still are!) very important factors to us — if a given replica is lagging behind by more than N (varying according to the database and the cluster we’re connecting to) seconds, we prefer not to use it until it recovers as it would otherwise lead to stale reads.

Custom health checks

Not having found an open source tool that implements powerful enough health-checks for PostgreSQL, we decided to write our own. These were the requirements:

  1. Work equally well on both environments — RHEL 5/HAproxy on Softlayer and Debian 7/ELBs on AWS
  2. Check basic TCP connectivity, on an arbitrary port, with a configurable timeout
  3. Check server availability by running a test query with a time limit — if a server is under load, it may be responding to TCP but not able to process a simple query (SELECT 1). We need to distinguish between these two scenarios, and potentially take different actions
  4. Check replication lag (time elapsed since the last transaction was replayed)
  5. Support custom health checks in the form of SQL queries — extensible and future-proof
  6. Low memory footprint — avoid “stealing” memory from PostgreSQL
  7. Minimal list of external dependencies

A web service, exposing a simple HTTP endpoint, would work in any environment and easily be able to test TCP connectivity. Simple queries and testing replication lag are just a special case of running arbitrary SQL queries as a health check, so we just focused on this one and implemented the others as a form of syntactic sugar.

Programming languages One important decision for delivering a platform independent solution with low memory footprint and minimal dependencies was the choice of the programming language. We considered a few from Python (there was already a reasonably large Python code base at Thumbtack), to Go (we were taking our first steps with it), and even Rust (too immature at the time).

We ended up writing it in C. It was easy to meet all requirements with only one external dependency for implementing the web server, clearly no challenges running it on any of the Linux distributions we were maintaining, and arguably the implementation with the smallest memory footprint given the choices above.

The final result

We named the project pgDoctor and made it publicly available on our Github repository. It uses microhttpd to implement a very simple web service that listens on port 8071, logs to the local7 syslog facility (configurable), and provides a reasonably rich set of configuration parameters. The behavior is quite simple: an HTTP GET request to :8071 returns 200 if all checks pass, 500 otherwise. All errors are logged.

pgDoctor has been running flawlessly on all our PostgreSQL replicas for roughly 3 years now, having gone through two major upgrades (9.1 –> 9.4 –> 9.6). As of now, there are 18 streaming replicas, all running pgDoctor alongside PostgreSQL, and distributed among 4 clusters. Each cluster supports different use cases and requires slightly different health checks.

PostgreSQL replicas are sometimes taken out of rotation. The most common reasons are temporary high replication lag or some transient issue with the underlying EC2 instance. As expected, they are added back to the cluster without any intervention once normality is restored and the health checks succeed.

Figure 2 shows a diagram of (a downsized version of) our production environment:

  • Three availability zones;
  • One master node and two hot-standby instances on different availability zones;
  • Three clusters of read-only replicas, streaming from the master, each with its own load balancer;
  • Several clients, on all availability zones, reading from one or more clusters and writing to the master.

Thumbtack Postgres Architecture 2

Does this sound interesting? There is a lot more to be done. Join Thumbtack and help us build, scale, and operate a high reliability service!

Related work

http://www.severalnines.com/mysql-load-balancing-haproxy-tutorial#issues https://www.digitalocean.com/community/tutorials/how-to-use-haproxy-to-set-up-mysql-load-balancing--3 http://www.severalnines.com/mysql-load-balancing-haproxy-tutorial#issues


Originally posted on Thumbtack Engineering

Engineering Tech Lead
San Francisco

Have you ever tried to hire a plumber? How about a house cleaner? If you have, chances are it took you way longer than it should. In the era of instant-everything, it’s crazy that you still have to waste an entire afternoon researching, calling and vetting local service professionals whenever you need one. The market for hiring them is huge — $700B in the US alone — but the process is inefficient and largely offline.

Thumbtack is transforming this experience end-to-end, building a marketplace that matches millions of people with local pros for almost any project. In making these connections, not only do our customers get more done every day, our pros are able to grow their businesses and make a living doing what they’re great at.

About the Engineering Team

At Thumbtack, the engineering team is responsible for building a great product and the infrastructure powering it. This involves tackling challenging technical problems across our search experience, matching algorithm, scheduling flows, messaging platform, payments, and more. Building an outstanding experience for each of these products is made more complex by the sheer scale of our approach: Thumbtack simultaneously operates in nearly 1,000 categories, in every county in the US. Our new Instant Match tool has allowed us to truly revolutionize the experience of hiring local pros, while increasing supply to meet the robust customer demand we’ve spent years growing.

About the Role

Tech leads at Thumbtack are responsible for driving the technical direction of the team, and are ultimately accountable for the team’s impact. As a tech lead, you will work with the team and cross-functional partners to develop the team’s strategy and roadmap. You’ll then lead the engineering team in building out the systems and products  to execute on that vision. Tech leads must be deeply technical and lead from the front - we expect TLs to be hands-on in writing code, designing systems, and helping the team make high quality technical decisions that lead to impact.

Responsibilities

  • Be a deep technical leader on the team. Work with the eng team to establish technical direction, set priorities, and guide key technical decisions.
  • As a technical leader, make substantial technical contributions by writing code and designing systems. TLs are not expected to lead every project or systems design, but are expected to make some deep technical contributions to our system.
  • Work with the eng team and cross functional leads across product, design, and marketing to craft a compelling strategy and roadmap for the team.
  • Drive execution on the team, developing processes, and helping the team find ways to move forward.
  • Collaborate with other teams to align on priorities and execution across Thumbtack.
  • Work closely with other engineering leaders to continually improve Thumbtack’s engineering culture in a high paced growth environment.

Must-Have Qualifications

  • 2+ years experience in an engineering team lead role, as either a manager or tech lead, and 5+ years experience building software at scale.
  • Fluency in programming, and ability to pick up and switch between multiple languages. In our stack, we mainly use PHP, Scala, and Go, with Swift, Kotlin and Java for our mobile apps.
  • Strong leadership, communication and collaboration skills.

Nice-to-Have Qualifications

  • Experience at a consumer tech company and/or a marketplace business.
  • Experience building and scaling reliable, performant distributed systems or building high quality user-facing technology products.

More About Us

Thumbtack is a local services marketplace – one of the largest in the U.S. – that helps millions of people hire local professionals. With nearly 1,000 different categories, customers can find a Thumbtack pro for almost anything: landscapers, DJs, personal trainers, even piano teachers. And in making these connections, we empower local pros too. Helping them get new customers and make a living doing what they’re great at.

Founded in 2008 and headquartered in San Francisco, Thumbtack is backed by over $250 million in investment from Sequoia Capital, CapitalG, Tiger Global Management, Javelin Investment Partners and Baillie Gifford.

Thumbtack embraces diversity. We are proud to be an equal opportunity workplace and do not discriminate on the basis of sex, race, color, age, sexual orientation, gender identity, religion, national origin, citizenship, marital status, veteran status, or disability status. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Comments
Open jobs at Thumbtack
Android Engineer
san francisco
Accomplish your personal projects
Engineering Director
san francisco
Accomplish your personal projects
iOS Engineer
San Francisco

Thumbtack is a local services marketplace that connects customers with the right professionals for anything they need done. Every day we rally around the impact Thumbtack has on people’s lives — helping brides plan their perfect wedding, families improve their homes, and small businesses grow and thrive. Thumbtack today has 1,100+ types of services, millions of customer requests, and hundreds of thousands of paying professionals. Join our growing team in the quest to build THE destination for anything you need done.

Both sides of the Thumbtack marketplace (independent professionals and customers) will benefit massively from enhanced mobile experiences. Imagine what you might do if you could run your business from your iPhone? If you could hire a pro at any time day or night with just a few taps? The sky's the limit.

The front-­end team at Thumbtack invests heavily in responsive web design and fast, elegant mobile experiences. We've seen this investment pay off incredibly. However, we think we can do better with a fully native experience. We’ve started by releasing the Thumbtack app for customers: https://itunes.apple.com/us/app/thumbtack/id852703300. But there’s still a lot more to come and we want you to be part of it.

We're looking for someone with a history of building amazing iOS applications. You'll be working with great engineering and design teams who are eager to assist you in building something really exceptional. 

About You

  • You’re an excellent iOS/Objective-­C engineer who is comfortable building beautiful iOS experiences
  • You have previous experience releasing apps on the App Store (We're looking for someone who's done this before!)
  • You’re excited by the potential of mobile technology to transform the lives of small businesses around the country
  • You have empathy for the user and an eye for great, user­-friendly design
  • You’re a great communicator. We find that people who are good writers tend to be great thinkers and great coders

More About Us

Thumbtack is the destination for getting things done—from house remodeling to event planning to music lessons and more. Each year, more than 200,000 professionals across the country service a growing 5 million Thumbtack projects in almost 1,100 unique categories. Founded in 2009 and headquartered in San Francisco, Thumbtack has raised more than $275 million from Sequoia Capital, Tiger Global Management, Javelin Investment Partners, Baillie Gifford, and Google Capital. 

Thumbtack embraces diversity. We are proud to be an equal opportunity workplace and do not discriminate on the basis of sex, race, color, age, sexual orientation, gender identity, religion, national origin, citizenship, marital status, veteran status, or disability status. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

#LI-POST

Site Reliability Engineer
San Francisco

Thumbtack is a local services marketplace that connects customers with the right professionals for anything they need done. Every day we rally around the impact Thumbtack has on people’s lives — helping brides plan their perfect wedding, families improve their homes, and small businesses grow and thrive. Thumbtack today has 1,100+ types of services, millions of customer requests, and hundreds of thousands of paying professionals. Join our growing team in the quest to build THE destination for anything you need done.

Our Site Reliability Engineers are a hybrid of software and systems engineers. We code our way out of operational problems and into chocolate chip cookies.

Our current mission is to design Thumbtack’s next version of the core infrastructure. We are responsible for reliability, scalability, and automation, while keeping an eye on latency, performance, and capacity.

Come help us build a scalable infrastructure to help millions of users get the right pro’s for all of life’s projects.

What we need your help with:

  • Automate the server provisioning process (we have 500 and growing) -- humans should not be involved.
  • Influence and create new designs and architectures for a growing number of distributed systems (multi regions, evaluating kubernetes)
  • Plan and execute configuration management (Puppet) and monitoring (DataDog) of our platform as it grows.
  • Design the system and processes that engineers use to deploy their software into production.
  • Design, write, and maintain software to improve the availability, scalability, latency, and efficiency of Thumbtack's services, incorporating third-­party tools (ELK, pgbouncer, ZFS, HAproxy) when available and writing software of your own when nothing else fits the bill.
  • Engage in service capacity planning and demand forecasting, anticipating performance bottlenecks and provisioning new hardware as necessary.
  • Run software performance analysis and system tuning.
  • Plan and execute DiRT.
  • Participate in rotating on-call duties.

You’re good at:

  • Fluent in one or more of: C, Scala, Python, Go.
  • Familiarity with algorithms, data structures, and complexity analysis.
  • Experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols.
  • Experience with network protocols and theory (TCP/IP, UDP, ICMP, MAC addresses, IP packets, DNS, OSI layers, and load balancing, etc.).
  • Systematic problem solving approach.

You might be also good at:

  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
  • In-depth knowledge of operating systems (processes, threads, IPC, concurrency, locks, mutexes, semaphores, etc.).
  • Strong sense of ownership and drive.
  • Experience with Puppet, or some other configuration management tool.
  • Experience with Amazon Web Services.
  • Experience with PostgreSQL tuning and performance.

More about us

Thumbtack is the destination for getting things done—from house remodeling to event planning to music lessons and more. Each year, more than 200,000 professionals across the country service a growing 5 million Thumbtack projects in almost 1,100 unique categories. Founded in 2009 and headquartered in San Francisco, Thumbtack has raised more than $275 million from Sequoia Capital, Tiger Global Management, Javelin Investment Partners, Baillie Gifford, and Google Capital. 

Thumbtack embraces diversity. We are proud to be an equal opportunity workplace and do not discriminate on the basis of sex, race, color, age, sexual orientation, gender identity, religion, national origin, citizenship, marital status, veteran status, or disability status. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

#LI-POST

Verified by
You may also like
E-Commerce at Scale: Inside Shopify's Tech Stack
How SendGrid Scaled to 40 Billion Emails Per Month
How Stream Built a Modern RSS Reader With JavaScript
How Heap Built an Analytics Platform that Auto-Tracks Every User Event