How We Built A Blazing Fast Audio/Video Platform With Just A Team Of 4

Instant team audio built into your desktop

By Tom Moor, Co-founder at Speak.



Speak is a team communication tool. We pride ourselves on being the simplest and fastest way to get an audio and video connection with your colleagues. In particular we’re focused on the use case of remote teams and how we can recreate the social and collaborative benefits of the office in a digital domain.

Our startup's hypothesis was that if you provide a very fast way for teams to communicate then they will talk more often and end up being more efficient, and hopefully happy! With the architecture below we aimed for audio connection times in the order of ~200ms, this is generally perceived by the brain as ‘instant’ and we have been able to achieve this.

We were founded in San Francisco in 2013, but the founders are from England and Missouri. We practice what we preach though and our team has always been distributed, with other team members based on the east coast US and in the UK.

I’m the technical co-founder and on a day to day basis I split my time 50% between working directly on the product (usually focused on the frontend codebase) and on other founder-y duties like high level strategy, getting exposure for the product, hiring and admin.

Prior to Speak and our previous product, Sqwiggle, I worked at Buffer for two years and built the platform as we scaled from just a few thousand users up to the first million. I previously wrote about some of our early scaling challenges at Buffer here.


Prior to Speak we built another real time application called Sqwiggle. At Sqwiggle we started in the same way that many startups these days get off the ground - with what essentially amounted to a large monolithic Ruby on Rails application. In fact, we actually had three apps:

  • A REST API built on Rails API
  • A Rails app for the public website and account management
  • A Rails app as an admin portal

The Problem

Although there was some degree of separation at the application instance level, in real terms, the architecture was still fairly monolithic. All three applications shared code via rubygems (and at one point in time git submodules) as well as common resources such as databases and caching layers.

For many services this wouldn’t be an issue, however as we began to scale up and usage of the platform increased we found ourselves facing problems around race conditions more often - we soon realised that we were bending the framework and technologies out of their usual webpage/refresh domain and trying to make them work for a latency-sensitive real-time application… something that they simply weren’t designed for.

As an example we’d have users clicking twice to start a conversation and the two requests would be routed to different servers operating on the same database rows within milliseconds. We ended up requiring locking all over the codebase and ended up seeing many bugs caused by race conditions with data being in unexpected states.


In our case we were lucky enough to have the opportunity to start again with our architecture and incorporate the lessons learnt from over two years working on similar problems. With Speak, we took the SOA (Service Orientated Architecture) approach, largely inspired by a talk given by Chris Richardson. We have also chosen to use a technique called event sourcing.

Event sourcing is a relatively simple idea but profoundly changes the way one thinks about building a web based application. Rather than storing the state as is, you record how the state got to be where it is. This means you can replay all state changes and get to a specific point in time or the present state of the world - a huge advantage when running a realtime system where hundreds of changes often take place in milliseconds.

Here’s how the Speak SOA architecture is currently broken down:

  • A user authentication service, written in Ruby that exposes a JSON HTTP API for signing in and creating Speak accounts. This service uses Postgres as a datastore for passwords and tokens.
  • An event store written in Go and backed by MongoDB, for this we really just want fast writes and sequential reads.
  • A websocket service written in Go. This really just holds onto a websocket connection with each client and selectively forwards messages through into our internal message queue, RabbitMQ.
  • A service written in Ruby that handles the business logic of connecting and authorising calls, user profiles, and other miscellaneous business logic. We plan on splitting this service up further as we grow. The local cache here is MongoDB.

  • A software MCU written in Go and C that manages WebRTC connections to clients and mixing of audio streams to create ad-hoc audio conferences.
    • An admin service that allows us to inspect data, keeps track of metrics and edit customer information for support purposes, written in Ruby.

    Speak Architecture

    All of these microservices are hosted on EC2, the services that are more latency sensitive such as audio mixing and forwarding being distributed across different AWS regions and routed based on Route53’s latency DNS. We chose EC2 at this time mainly because our team has experience of working with its quirks and we wanted to keep time spent directly on ops as low as possible with the ability to easily scale up quickly in a pinch without having to change providers.

    We use RabbitMQ almost exclusively as the internal messaging system between microservices although synchronous responses are needed in a couple of spots, and for these we use HTTP. We chose RabbitMQ because of it’s speed (we’re talking hundreds of thousands of messages/second) and ability to run on any cloud provider or server setup. Each individual service gets the events it cares by listening to specific exchanges and has its own database where appropriate; we use a mixture of Postgres, MongoDB and Redis for these.

    The desktop application itself is actually written mainly in Javascript and uses a project by GitHub called Electron to hook into the system where needed. We’ve been able to achieve some really crazy stuff with this in a short amount of time!

    Development & Deployment

    We run everything locally in Docker for development, those of us on Mac also need to use boot2docker and Vagrant to enable this. Onboarding new folks into our team is obviously much quicker using this setup and we also get a lot of day-to-day wins when the setup changes - which tends to be quite often in the early stages!

    Right now we can (and have to) run the entire stack locally to work efficiently. This is quite the load on our development machines so we’ll soon be making this more intelligent, perhaps running more of the development media routing on AWS too.

    We use CircleCI for continuous integration and deployment, automatically triggered from Github hooks when code is pushed and merged. After trying a couple of providers we found circle to be cleanest and quickest CI tool - they also have a generous free plan which helps a lot when you’re company is just getting off the ground.

    For issue tracking, we use Github Issues extensively for each microservice - managing and tracking these can be tricky at times so assigning issues to individuals is key. Although our product is largely closed source we are using Github Issues to allow our users to submit bug reports directly, which is great as they can then have an open discussion with our team and get informed when the bug is squashed!

    Everything is monitored with New Relic and our logs are aggregated into Logentries. Sentry is used for tracking errors across all platforms - we enjoy the clean interface and Github integration here!

    For provisioning, we use Ansible, which is dead-simple, and has the advantage of not requiring a central server to host configuration directives. Since we’re using microservices, being able to easily handle and test configuration changes is critical.

    Upcoming Challenges

    The biggest challenges we have are around our media routing and audio mixing infrastructure - although we use Route53 to gather ideal servers from everyone in an organization we must make sure that everyone in a call connects to the same server so their audio can be mixed together. As we have more and more teams distributed around the world, keeping performance and quality of calls high is a top priority and this will inevitably take up a good portion of our small team’s time.

    Right now we are also manually managing instances in individual regions. Of course this isn’t particularly scalable however the plan is to start using docker in production in the same way that we do in development and change over to an autoscaling system as soon as our scale demands it.

    Instant team audio built into your desktop
    Tools mentioned in article