By Harlow Ward, ‎Developer and Co-founder at Clearbit.


Clearbit builds Business Intelligence APIs - Our suite of APIs are focused on Lead Enrichment and Automated Research.

Clearbit lookup example

Our goal is to help modern businesses make better data-driven decisions. Our platform aggregates data from hundreds of public sources and packages it up into beautifully hand-crafted JSON payloads.

Customers use our APIs to:

  • Give their sales team more information on customers, leads, and prospects.
  • Integrate and surface person/company data to the end-users of their systems.
  • Underwrite transactions and reduce fraud.

Outside of our paid products we also love releasing free products. These bite sized APIs are hyper focused on helping designers and developers enhance the user-experience of their tools and systems.

A few of these freebies include:


Engineering at Clearbit

Our engineering team consists of three developers: Alex MacCaw (also our fearless CEO), Rob Holland, and myself.

We are a small dev team, and that means we all wear a lot of hats. Day-to-day, it’s not uncommon to jump between Frontend HTML/JS/CSS, API design, Service administration, DB administration, Infrastructure management, and of course a little customer support.


Services Everywhere

We made the decision early on to build a microservice-first architecture. This means our system is composed of lots of tiny Single Responsibility Services (SRS anyone?).

In general these services are written in Ruby, leverage Sinatra to expose JSON endpoints, and use RSpec to verify accuracy. Each service maintains its own datastore; depending on the service's needs we’ll typically choose from Amazon RDS, Amazon DynamoDB, or hosted Elasticsearch with Found.

There are some great arguments to be made about a MonolithFirst architecture. However, in our case, we felt our data boundaries were reasonably clear from the beginning, and this allowed us to make a few low-risk bets around building and running a microservice-first architecture. So far so good!

Our web services fall into two categories:

  1. External (publicly accessible, authenticated via API keys).
  2. Internal (accessible within VPC, locked down to specific security groups).

At any given time we’re running 70+ different internal services across a cluster of 18 machines. Our external (customer facing) APIs are serving upwards of 2 million requests per-day, and that number is rapidly increasing.


Early Days

When working with a microservice architecture it's difficult to overstate how important it is for a developer to be able to quickly push a new web service.

Our initial aritecture was built on Amazon EC2 and leveraged dokku-alt (a Docker powered mini-Heroku) to manage deployments.

Dokku-alt covered our basic requirements:

  • Git based deploys.
  • Managing ENV vars outside of config files.
  • Ability to rollback in case of emergency.

However, as the number of servers grew some shortcomings of dokku-alt began to emerge. This was no fault of dokku-alt; we were just outgrowing our architecture.

As we added more machines the problems compounded. The per-machine configuration management we had initially loved quickly became unsustainable. On top of that, running git push production master simultaneously to every box in the cluster made for some nerve-racking deploys.

The state of our deployment system was beginning to take a toll on the team's productivity. It was time to make a change. We collectively decided to explore our options.


Current Stack

As our infrastructure grew, our deployment requirements also evolved:

  • Distributed configuration management.
  • Git push to only one repository.
  • Blue/Green style deploys.

After looking into solutions like Deis and Flynn, we decided we'd feel happier with something with simpler semantics. We were attracted to Fleet because of it's simplicity and flexibility, and the reputation of the CoreOS team.

Co-ordinating configuration between machines became a breeze with the use of etcd. Now when our deployer app builds a new docker container we can inject environment variables from etcd directly into the container.

From there, we use Fleet to distribute the units accross our cluster of servers. We’ve found fleet-ui super handy for visualizing the distribution of units across our cluster.


fleetui


To keep our operational expenses down, we have a static pool of on-demand EC2 instances running the etcd quorum, HAProxy, and several of the HTTP front ends. On top of that, we leverage a dynamic pool of EC2 Spot Instances to handle the dynamic nature of our workloads during times of extremely high throughput.

Word to the wise: Don’t use Spot Instances as part of your etcd quorum -- When someone else bids higher than the current Spot Price (and they will), the Spot Instances will disappear without warning.


Monitoring

It’s hard to stress how important it’s been for us to have a deep and instantly available understanding of the current state of all our services.

Starting from the outside, we use Runscope to continually ping and analyze responses from our services. It’s been instrumental in verifying and maintaining the APIs with dynamic date versioning.

Digging a level deeper, we use Librato for measuring and monitoring lower level system behaviour. We’re diligent about creating alerts that will notify the team if anything seems awry.

Sentry notifies us immediatly via Slack and Email if any of our services are throwing errors. We’re big believers in the Broken windows theory, and try to keep Sentry as clean as possible.

Finally, we use SumoLogic as our log aggregation platform. We run Sumo Collectors on each of our hosts. SumoLogic is our last line of defense for spotting inconsistent system behaviour and debugging historical issues.


Looking Forward

We have a private contrib repo with a handful of rack middlewares that are shared across our services. These middlewares dramatically cut down on duplication of code around Authentication, Authorization, Rate Limiting, and IP Restrictions.

In general, the shared middleware approach has worked well for us. However, as we look to the future and the team continues to experiment with new languages, the Ruby middlewares can’t be shared across new languages in the polyglot system.

Our goal is to push this shared logic out of the services and into the proxy layer (possibly with the help of VulcanD, Kong, or some custom HAProxy foo).

If you have made a transition like this before, or have a an elegant idea of how to summersault this hurdle, I’d love to buy you a beverage. harlow@clearbit.com



Comments
timchunght timchunght

Just finished the article. Great read! How are you guys able to retrieve data instantaneously for the personal API? I understand that you can cache the company API but how are you guys caching the Person API as there will be billions of users online across all social networks.

1
over 1 year ago
harlow harlow

timchunght great question! When we don't have an email address (or company domain) in cache we use our robust worker platform to doing live lookups. We leverage Sidekiq Pro to process millions of Lookup Jobs per day. We can typically finish a Person Lookup in under 3 seconds by leveraging the massive concurrency that Sidekiq affords us.

3
over 1 year ago
timchunght timchunght

Great answer! Thank you for your response. I am a fanboy of Sidekiq and I enjoy its speed and ease of use. What are some places you guys look for personal info. Are they just APIs or do you do some social network scraping? Thanks.

1
over 1 year ago
tedyoung tedyoung

Nice run-through, thanks for writing it up. Curious what other languages you're looking at and where they'll fit in your environment?

Also, I've run into somewhat similar things around securing and limiting APIs, though on a much smaller scale (internal corporate use), using nginx and Repose (https://github.com/rackerlabs/repose).

1
over 1 year ago
harlow harlow

We have a couple of things written in Node.js, and we've prototyped a few of the services in Golang.

REPOSE looks interesting. Thanks for the link, will need to give that a look too.

1
over 1 year ago
barkerja barkerja

This is a great writeup, thank you for sharing! One question though: using a REST API for communication between your services, how do you ensure deliverability/durability (in the case of a network issue, service failure, etc.)

1
over 1 year ago
harlow harlow

Most of the the inter-service communication happens from the background jobs.

For each of those jobs we'll ask the following questions:

  • Does it make sense to retry the HTTP request?
  • How many retries?
  • Can the current work complete without a successful response?

We've written some error handling middleware for Sidekiq that allows us to specify what to do in each of the possible error cases.

2
over 1 year ago
ccerrato147 ccerrato147

Great article. Soon I´ll be working on a project that will require scaling to millions of requests per day (if we´re lucky) and I´ll have this article as a guide for using micro-services.

Thanks!

1
over 1 year ago
ioRekz ioRekz

I'm curious about the deployer app. Could you tell a bit more and explain how does it connect to the rest of the architecture ?

Thanks

1
over 1 year ago
robholland robholland

Our deployer app is a git server with a repo per application. To deploy a new version of an application we git push, the deployer app builds a docker image of the code, pushes that to our docker registry tagged with the git SHA and then uses an in-house tool called fluster to start any required containers for the new version of the app based on the docker image.

0
over 1 year ago
nobozo nobozo

"because of it's simplicity" -> "because of its simplicity"

0
over 1 year ago


Verified by
Favorite
15
Views
3555
Favorite
Views
3555