By Paul Liu, DevOps Developer, at 500px. You can find him here on StackShare and on Twitter


Background

500px is an online community for premium photography. Millions of users all over the world share, discover, sell and buy the most beautiful images. We value design, simplicity in the code, and getting stuff done.

I'm a DevOps Developer at 500px working on the Platform Team. I work on stuff like backend systems, monitoring, configuration management, and deployment and automation. Prior to joining 500px, I spent many years as a sysadmin working in the insurance industry.

Engineering at 500px

The 500px Engineering team is split into four groups: the Web team, the Mobile team, the QA/Release Engineering team and my team, the Platform team, which is a combined team that handles building our API and backend services as well as technical operations and infrastructure management.

Our teams are highly cross functional and boundaries are fairly loose, so engineers wind up moving around a lot between teams to work on specific projects. This helps us spread knowledge and prevent siloing. There is also a very tight communications loop between engineering, product, design and the customer excellence teams, which helps to keep us honest, agile, and focused on delivering the right things.


A diverse group of people


General Architecture

The architecture of 500px can be thought of as a large Ruby on Rails monolith surrounded by a constellation of microservices. The Rails monolith serves the main 500px web application and the 500px API, which powers the web app, mobile apps and all our third party API users. The various microservices provide specific functionality of the platform to the monolith, and also serve some API endpoints directly.

The Rails monolith is a fairly typical stack: the App and API servers serve requests with Unicorn, fronted by Nginx. We have clusters of these Rails servers running behind either HAProxy or LVS load balancers, and the monolith's primary datastores are MySQL, MongoDB, Redis and Memcached. We also have a bunch of Sidekiq servers for background task processing. We currently host all of this on bare metal servers in a datacenter.

The microservices are a bit more interesting. We have about ten of them at the moment, each centered around providing an isolated and distinct business capability to the platform. Some of our microservices are:

  • Search related services, built on Elasticsearch
  • Content ingestion services, in front of S3
  • User feeds and activity streams, built on Roshi and AWS Kinesis
  • A dynamic image resizing and watermarking service
  • Specialized API frontends for our web and mobile applications

We run our microservices in Amazon EC2 and in our datacenter environment. They are mostly written in Go, though there are a couple outliers which use NodeJS or Sinatra. However, regardless of the language in use, we try to make all of our microservices good 12-factor apps, which helps to reduce the complexity of deployments and configuration management. All of our services are behind an HAProxy or an ELB.


Hard at work at 500px


The microservices pattern is great because it allows us to abstract away complex behaviour and domain-specific knowledge behind APIs and move it out of the monolith. Front-end product teams consuming these services only have to know about the service's API, and service maintainers are free to change anything about their service (as long as they maintain that API!). For example, you can query the search service without having to know a thing about Elasticsearch. This flexibility has proven to be extremely powerful for us as we evolve our platform because it lets us try out new technologies and new techniques in a safe and isolated way. If you are curious about implementing microservices yourself, former 500pxer and overall rad dude Paul Osman gave a great talk at QConSF last year about how and why we did it. My face is on one of the slides, but that is only one of the many reasons why this talk is awesome.

Image Processing

Probably the most interesting set of microservices we run at 500px are the ones having to do with serving images and image processing. Every month we ingest millions of high resolution photos from our community and we serve hundreds of terabytes of image traffic from our primary CDN, Edgecast. Last month we did about 569TB of total data transfer, with 95th percentile bandwidth of about 2308Mbps. People really like looking at cool pictures of stuff!


A representative cool picture. Also, 500px's home town, Toronto


To ingest and serve all these images we run a set of three of microservices in EC2, all built around S3, which is where we store all of our images. All three of these services are written in Go. We really like using Go for these cases because it allows us to write small, fast, and concurrent services, which means we can host them on fewer machines and keep our hosting costs under control.

The first microservice users encounter when they upload a photo is one we call the Media Service. The Media Service is fairly simple: it accepts the user's upload, does some housekeeping stuff, persists to S3, and then finally enqueues a task into RabbitMQ for further processing.

Next, consuming those tasks off of RabbitMQ is another service called (creatively) the Converter Service. The Converter Service downloads the original image from S3, does a bunch of image processing to generate various-sized thumbnails, and then saves these static conversions back to S3. We then use these conversions in lots of places around our site and for our mobile apps.

Probably so far this isn't very surprising for a photo-sharing website, and, for awhile, these two services did everything we needed -- we simply set the S3 bucket containing the resulting thumbnails as the origin for our CDN. However, as the site continued to grow, we found this solution was pretty costly and space inefficient, as well as not very flexible when new products required new sizes.

To solve this problem, we recently built what we creatively call our Resizer Service (yes, we tend to choose descriptive names for these things). This new service now acts as the CDN origin and dynamically generates any size or format of image we need using the S3 original. It can also watermark the image with a logo and apply photographer attribution, which is reassuring to our community.

The Resizer Service is fairly high throughput, with the cluster handling about 1000 requests per second during peak times. Doing all this resizing and watermarking is pretty compute-intensive, so it's a bit of a challenge to keep response times reasonable when the load is high. We've worked really hard on this problem, and at peak traffic we're able to maintain a 95th percentile response time that is below 180ms. We do this through the use of a really cool, really fast image processing library called VIPS, aggressive caching, and by optimizing like crazy. Outside of peak hours, we can usually get below 150ms.

And we're not done with this problem yet! There are almost certainly more optimizations to be found, and we hope to keep pushing those response times down further and further in the future.

Workflow

We use Github and practice continuous integration for all of our primary codebases.

For the Rails monolith, we use Semaphore and Code Climate. We use a standard rspec setup for unit testing, and a smaller set of Capybara/Selenium tests for integration testing. Fellow 500pxer and professional cool guy Devon Noel de Tilly has written at length about how we use those tools, so I won't try to out do him -- just go check it out.

For our Go microservices, we use Travis CI to run tests and to create Debian packages as build artifacts. Travis uploads these packages to S3, and then another system pulls them down, signs them, and imports them into our private Apt repository. We use FPM to create packages, and Aptly to manage our repos. Lately, though, I've trying out packagecloud.io and I really like it so far, so we may be changing how we do this in the near future.

For deployments, we use a combination of tools. At the lowest level we use Ansible and Capistrano for deploys and Chef for configuration management. At a higher level, we've really embraced chatops at 500px, so we've scripted the use of those tools into our beloved and loyal Hubot friend, BMO.


Headquarters in snowy Toronto


Anyone at 500px can easily deploy the site or a microservice with a simple chat message like bmo deploy <this thing>. BMO goes out, deploys the thing, and then posts a log back into the chat. It's a simple, easy mechanism that has done wonders to increase visibility and reduce complexity around deploys. We use Slack, which is where you interact with BMO, and it makes everything really nicely searchable. If you want to find a log or if you forget how to do something, all you have to do is search the chat. Magical.

Other Important Apps

We monitor everything with New Relic, Datadog, ELK (Elasticsearch, Logstash and Kibana), and good old Nagios. We send all our emails with Mandrill and Mailchimp and we process payments with Stripe and Paypal. To help us make decisions, we use Amazon's Elastic MapReduce and Redshift, as well as Periscope.io. We use Slack, Asana, Everhour, and Google Apps to keep everyone in sync. And when things go wrong, we've got Pagerduty and Statuspage.io to help us out and to communicate with our users.

The Future, Conan?

Right now I'm working on experimenting with running our microservice constellation in Docker containers for local dev (docker-compose up), with an eye to run them in production in the future. We've got a CI pipeline working with Travis and Docker Hub, and I'm really excited by the potential of cloud container services like Joyent Triton and Amazon ECS. As we build more and more microservices and expand the stack, we're also looking at service discovery tools like Consul and task frameworks like Mesos to make our system scale harder better and faster.


A long and winding road


More Faster Is More Better

We're expanding quickly and hiring all kinds of positions. We're looking for DevOps types, Backend and Frontend developers, Mobile developers (both Android and IOS), UX designers, and sales people. We build cool stuff, we're passionate, and we're flying by the seat of our pants at breakneck speed, building the best thing we know how to make. If you like doing awesome cool stuff and you aren't afraid to get your hands dirty, come join us.


Jobs at 500px
Android or iOS Developer
Everhour + aptly + fpm + Kibana + Logstash +
Associate Product Manager
Everhour + aptly + fpm + Kibana + Logstash +
Data Analyst
Everhour + aptly + fpm + Kibana + Logstash +
Engineering Manager
Everhour + aptly + fpm + Kibana + Logstash +
See more open positions



Jobs at 500px
Android or iOS Developer
Everhour + aptly + fpm + Kibana + Logstash +
Associate Product Manager
Everhour + aptly + fpm + Kibana + Logstash +
Data Analyst
Everhour + aptly + fpm + Kibana + Logstash +
Engineering Manager
Everhour + aptly + fpm + Kibana + Logstash +
See more open positions



Comments
sergiotapia sergiotapia

You think image sharing and you think it's super simple to just plop an S3 over your web app. Reading this showed me a lot of things I haven't even considered! Really interesting stuff, thanks for sharing.

7
10 months ago
petervandenabeele petervandenabeele

I see your are using Kinesis. How would you compare it with setting up a Kafka cluster (re: performance, investment)?

2
11 months ago
cdmicacc cdmicacc

Hi Peter,

I can't speak to performance, as we never actually ran Kafka in production, but we chose to use Kinesis because we didn't want to manage our own Kafka (& thus Zookeeper) cluster, at least for now. 500px is run by a fairly small group of engineers, so we try to use cloud services, like AWS Kinesis, when it makes sense from a cost vs time investment perspective. Of course, at some future point we may reevaluate this decision and move onto an internally-managed solution such as Kafka.

1
10 months ago
LiranCohen LiranCohen

Are you guys looking to eventually leave rails?

1
11 months ago
cdmicacc cdmicacc

I'm the Director of Platform at 500px. We're not planning on leaving Rails any time soon, but we are slowly moving parts of the monolith into new, separate microservices, some of which will be in Go, but others may be in Sinatra or Rails -- we'll choose the right technology for the service in each case.

2
11 months ago
jasdeepsingh jasdeepsingh

what's the benefit behind using VIPS over something like ImageMagick?

1
11 months ago
cdmicacc cdmicacc

Speed. Our resizer service needs to respond within about 200 milliseconds with an image, and our users tend to upload very large photos (sometimes hundreds of megabytes in size). ImageMagick/GraphicsMagick unfortunately are just not performant enough to meet this goal for these larges images.

Of course, VIPS has its own drawbacks -- it took us quite a while to determine the right configuration of filters and resize operations to produce a good quality image, something that comes pretty much built-in with IM or GM -- but its performance made the investment worthwhile.

2
11 months ago
jcupitt jcupitt

It depends on the processing you are doing, but on this benchmark VIPS is four times faster than ImageMagick and needs only 1/20th of the memory.

2
10 months ago
kmartinez kmartinez

Wikimedia and booking.com found the same - its great to see all our hard work on optimising VIPS (25yrs!) has been worth it. Ironically we started with speeding up the processing of one "big" image on a $10k workstation and now it is used to process millions of images as fast as possible.

2
10 months ago
ltregan ltregan

MongoDB has fair features to search based on attributes. Since you use elasticsearch also, could you explain at what point did you need to introduce it in the stack ?

Thanks,

1
8 months ago
stevesun21 stevesun21

Nice post, I adopted the same idea as I design ImageS3, which lets you run image processing as a microservice for all your projects. And it also use AmazonS3 for hosting image files as well.

BTW, it is open source project, images3.com

0
11 months ago
ghprod ghprod

Unbelieveable, beyond my knowledge, i dont know that image sharing would be very complicated ..cheers guys :)

0
about 1 year ago
dhay06 dhay06

I said the same think.. :)

0
5 months ago
ted90 ted90

I was looking for this information, thanks for the post! 192.168.1.1

0
3 months ago
3 months ago
ink109 ink109

It's a diagnostics tool developed by a trusted and respected contributor here. 192.168.l.l

0
about 1 hour ago
peterstrong123 peterstrong123

good, the content is simple, easy to understand 192.168.1.1

0
3 minutes ago


Verified by


Favorite
94
Views
67291

Share your stack

Favorite
Views
67070