Dubsmash: Scaling To 200 Million Users With 3 Engineers

By Tim Specht, Co-Founder and CTO, Dubsmash.

Dubsmash is the global internet phenomenon that lets you be anyone you want to be. By adding famous movie, television, and internet quotes to your camera, you can create fun, memorable, meaningful videos to share with the world. We’ve reached over 350 million installs, 6 billion videos created so far and tens of thousands of daily downloads across 192 countries—including Jimmy Fallon, Rihanna (RIHANNA!), Cara Delevingne, and an unmanageable backlog of A-list celebrities (along with regular people like you and me) creating dubs and showing the world who they are. (If you haven't used it, check out our Instagram or just click the gif below to see how it works.) Dubsmash is redefining the relationship between people and the content they love. Oh, and it’s addictive as hell.

As Co-Founder and CTO my main focus is setting the technical direction and building the engineering team—making sure the product development is going as quickly and smoothly as possible to ensure we put a smile on our users’ faces on a daily basis.

The team

Our engineering team is intelligent, scrappy, and experimental. With only three engineers, we have to be! We have all the necessary skills and capabilities to build core applications in-house, from Android, iOS and frontend development over backend engineering to infrastructure. While we usually try to work in cross-functional teams, we break out into platform-specific tasks sometimes when needed. We usually aim to be an intelligent version of scrappy and don’t try to reinvent the wheel - if there is someone out there offering a hosted solution to one of our problems with great technology under the hood, we are usually all up for it!

As the engineering team is still quite small, everyone is wearing multiple different hats every day, tackling different problems. While this is certainly a fun and entertaining challenge, every single one of us gets to impact our huge user base in very direct and meaningful ways.

We just finished relocating the team from Berlin, Germany to New York. We came to the realization a few months ago that we weren’t just a mobile app, but a content-centric platform that needs to live across the web and other applications. You have to radically rethink how you do business, and how you build and architect your systems.

We’re currently 3 engineers and 12 people in total, but are looking to expand to 9 engineers over the next couple months. Key positions include a VP Engineering, a Senior React Engineer, Product Managers, and iOS / Android Engineers. If you want to join our team drop us an application!

How it all started

Dubsmash started off as a simple mobile application containing all the funny Quotes you could imagine. We consider a Quote a short, memorable, exciting part of a movie, TV show or any other part of the quoteable internet. Since our initial launch we transformed Dubsmash into an interactive video Quote database that enables you to interact with those Quotes, dub (e.g. lip-sync) them and most importantly—share them with your friends! We are still very actively working on our product to make sure it’s the one-stop-shop to discover, enjoy and share the funny little moments everyone of us loves so much :)

The app in the beginning was simply downloading a JSON file from S3 containing the Quote metadata. This file was updated & uploaded to S3 by hand every time we had new content available; we would simply put in the URL to the sound file, the name of the Quote, and re-upload the file. We chose this really simple mechanism to avoid having to bootstrap a custom API to distribute the content to the clients. This turned out to be a great business decision as well, since we didn’t need to worry at all about any scaling issues in the beginning; this became an even better call a couple weeks after the initial launch.

Jenkins ran all our code-pushes on both Android & iOS and was executing custom-written shell scripts utilising xcodebuild and gradle. We used Google Analytics to track user and market growth and Pushwoosh to send out push notifications by hand to promote new content. Even though we didn’t localize our pushes at all, we added custom tags to devices when registering with the service so we could easily target certain markets (e.g. send a push to German users only) which was totally sufficient at the time.

Since we were always big supporters of an “iterate quickly & learn things” mentality we didn’t even submit our mobile applications to the respective stores until we were tens of thousands of users in. We used our own simple web page to distribute the APK files directly and a separate Enterprise account for iOS devices to distribute the applications to our users (gotta love iOS 8!). Users would go on our own small, static HTML site, download the APK or IPA file and could run the application right away. Once we pushed updates we had a small alert popping up inside the application that would ask the users to update to the latest version automatically. This enabled us to ship updates multiple times a day and increase the speed of product iterations dramatically. Within minutes of the code being written, our users would already have it on their phones and start using it!

From prototype to 200 million users around the globe

Our stack has gone through quite an evolution from shipping the first version two years ago. Not only has Dubsmash seen immense user growth (well over 350 million installations) but also dozens of features to service them. Naturally, this forced us to invest in a more advanced architecture on both the frontend side and the backend.

Overview

Dubsmash mobile applications are still written natively in Java on Android and Swift on iOS. After a transition phase of over a year having both Objective-C and Swift side-by-side in our iOS application, we made the switch to Swift-only a couple of months ago. Right after Swift was initially released by Apple in 2014 we tried it out—and were big fans of features like Optionals, Generics and Extensions right away! We found it to have a very nice learning curve and were able to quickly write our code in a much nicer, more concise, and safer way compared to Objective-C and wouldn’t want to miss it in our toolbox nowadays!

While we are actively observing the developments around Kotlin on Android, we haven’t done the switch yet and will most likely follow a hybrid approach for the next couple of months.

The backend systems consist of 10 services (and rising …), written primarily in Python with a couple in Go and Node.js. The Python services are written using Django and Django REST Framework since we collected quite a lot of domain-knowledge over the years and Django has proven itself as a reliable tool for us to quickly and reliably build out new services and features. We recently started to add Flask to our Python stack to support smaller services running on AWS Lambda using the (awesome) Zappa framework. While we could talk about our stack for days on days, some things that are worth pointing out:

Infrastructure

A microservice architecture usually requires services to communicate with each other in certain situations; hence, we implemented both simple HTTP-based (e.g. internal APIs) as well as task-based communications channels on top of Celery and RabbitMQ for inter-service communication.

We started using Go as the language of choice for high-performance services a little over a year ago and both our authentication services as well as our search-service are running on Go today. In order to build out web applications we use React, Redux, Apollo, and GraphQL.

Since we deployed our very first lines of Python code more than 2 years ago we are happy users of Heroku: it lets us focus on building features rather than maintaining infrastructure, has super-easy scaling capabilities, and the support team is always happy to help (in the rare case you need them). We played with the thought of moving our computational needs over to barebone EC2 instances or a container-management solution like Kubernetes a couple of times, but the added costs of maintaining this architecture and the ease-of-use of Heroku have kept us from moving forward so far. Running independent services for different needs of our features gives us the flexibility to choose whatever data storage is best for the given task. Over the years we have added a wide variety of different storages to our stack including Postgres (some hosted by Heroku, some by RDS) for storing relational data, DynamoDB to store non-relational data like recommendations & user connections, or Redis to hold pre-aggregated data to speed up API endpoints. Since we started running Postgres ourselves on RDS instead of only using the managed offerings of Heroku, we've gained additional flexibility in scaling our application while reducing costs at the same time. We are also heavily testing Aurora in its Postgres-compatible version and will also give the new release of Aurora Serverless a try!

Search

Although we were using Elasticsearch in the beginning to power our in-app search, we moved this part of our processing over to Algolia a couple of months ago; this has proven to be a fantastic choice, letting us build search-related features with more confidence and speed. Elasticsearch is only used for searching in internal tooling nowadays; hosting and running it reliably has been a task that took up too much time for us in the past and fine-tuning the results to reach a great user-experience was also never an easy task for us. With Algolia we can flexibly change ranking methods on the fly and can instead focus our time on fine-tuning the experience within our app.

Memcached is used in front of most of the API endpoints to cache responses in order to speed up response times and reduce server-costs on our side.

Video upload

In the early days features like My Dubs, which enable users to upload their Dubs onto our platform, uploads were going directly against our API, which then stored the files in S3. We quickly saw that this approach was crumbling our API performance big time. Since users usually have slower internet connections on their phones, the process of uploading the file took up a huge percentage of the processing time on our end, forcing us to spin up way more machines than we actually needed. We since have moved to a multi-way handshake-like upload process that uses signed URLs vendored to the clients upon request so they can upload the files directly to S3. These files are then distributed, cached, and served back to other clients through Cloudfront.

Notifications

Whenever we need to notify a user of something happening on our platform, whether it’s a personal push notification from one user to another, a new Dub, or a notification going out to millions of users at the same time that new content is available, we rely on AWS Lambda to do this task for us. When we started implementing this feature 2 years ago we were luckily able to get early access to the Lambda Beta and are still happy with the way things are running on there, especially given all the easy to set up integrations with other AWS services. Lambda enables us to quickly send out million of pushes within a couple of minutes by acting as a multiplexer in front of SNS. We simply call a first Lambda function with a batch of up to 300 push notifications to be sent, which then calls a subsequent Lambda function with 20 pushes each, which then does the call to SNS to actually send out the push notifications. This multi-tier process of sending push notifications enables us to quickly adjust our sending volume while keeping costs & maintenance overhead, on our side, to a bare minimum.

Feeds and Trending

As already mentioned Dubsmash's very small engineering team has always made a point to spend its resources on solving product questions rather than managing & running underlying infrastructure. We recently started using Stream for building activity feeds in various forms and shapes. Using Stream we are able to rapidly iterate on features like newsfeeds, trending feeds and more while making sure everything runs smooth and snappy in the background. With their advanced ranking algorithms and their recent transition from Python to Go, we are able to change our feeds ranking on the fly and gauge user impact immediately!

Scaling in-house analytics to millions of events per minute

The engineering team at Dubsmash has always made it a top priority to ship features vs. spending too much time on running and maintaining infrastructure. While this approach benefits us hugely most of the time, we still need to revisit certain architectural decisions from time to time to make sure things continue flowing nicely through our systems.

In order to accurately measure & track user behaviour on our platform we moved over quickly from the initial solution using Google Analytics to a custom-built one due to resource & pricing concerns we had. While this does sound complicated, it’s as easy as clients sending JSON blobs of events to AWS Kinesis from where we use Lambda & SQS to batch and process incoming events and then ingest them into Google BigQuery. Once events are stored in BigQuery (which usually only takes a second from the time the client sends the data until it’s available), we can use almost-standard-SQL to simply query for data while Google makes sure that, even with terabytes of data being scanned, query times stay in the range of seconds rather than hours.

Before ingesting their data into the pipeline, our mobile clients are aggregating events internally and, once a certain threshold is reached or the app is going to the background, sending the events as a JSON blob into the stream. In the past we had workers running that continuously read from the stream and would validate and post-process the data and then enqueue them for other workers to write them to BigQuery. However, we discovered after some time that the custom Python implementation for those workers was dropping up to 5% of the events. This was mostly due to the nature of how reading happens with Kinesis: every stream has multiple shards (ours up to 50!) and each reading client would use a so-called shard iterator to keep track of where it was reading last. Since the used machines could always crash, be recycled, or scaled down, we needed to save those shard iterators in some serialized format to Redis and share them across machines and process boundaries. Since we had so many shards, every once in awhile we would skip events and hence lose them.

In order to fix this behavior, we opted to switch the reading to AWS Lambda since it started supporting triggers directly from Kinesis. We were hoping that this would not only fix the lost-event issue but also significantly simplify the infrastructure needed on our side to read from the stream. We went ahead and implemented the Lambda-based approach in such a way that Lambda functions would automatically be triggered for incoming records, pre-aggregate events, and write them back to SQS, from which we then read them, and persist the events to BigQuery. While this approach had a couple of bumps on the road, like re-triggering functions asynchronously to keep up with the stream and proper batch sizes, we finally managed to get it running in a reliable way and are very happy with this solution today. You can read more on the technical details including some code examples here.

Continuous Integration & Deployment at Dubsmash

On the backend side we started using Docker almost 2 years ago. While in the beginning we used it mostly to ease-up local development, we have since started using it quickly to also run all of our CI & CD pipeline on top of it. This not only enabled us to speed things up drastically locally by using docker-compose to spin up different services & dependencies and making sure they can talk to each other, but also made sure that we had reliable builds on our build infrastructure and could easily debug problems using the baked images in case anything should go wrong. Using Docker was a slight change in the beginning but we ultimately found that it forces you to think through how your services are composed and structured and thus improves the way you structure your systems. Looking back, this was absolutely the right decision, as running things manually with so many services and so few engineers wouldn’t have been possible at all.

We use Buildkite to run our tests on top of elastic EC2 workers (which get spun-up using Cloudformation as needed) and Quay.io to store our Docker images for local development and other later uses.

Our code is hosted on Github. Once a commit is pushed to one of our many repositories, it’s automatically picked up by our fleet of CI workers. Stylechecks are run on the code as well as an extensive testing suite covering all areas of the service. Coverage data is collected while running the tests so we can keep track of whether a specific change is missing crucial tests.

Once a pull request is reviewed and approved, it’s automatically deployed to our staging environment where it’s tested for feature-completeness by both the developer and the PM. Once those manual last checks are done, the commit is automatically deployed to production without any further buttons needing to be pressed or scripts to be executed.

Our engineering team has always made ownership a crucial part of its culture: developers are responsible for their work from start to end. This includes not only developing but also QAing and deploying the changes. Hence we don’t need a dedicated QA-team or infrastructure team. If anything should go wrong we can easily roll back changes using just one click and revisit a breaking change at any time. Since we are still a small team that is servicing millions of users across the world every single day we've invested heavily in automation. For quite some time we have been running all our deployments completely automatically after they passed CI; we have tight alerting set-up on all our systems and services and can detect performance regressions automatically.

Not only has the deployment pipeline on the backend side significantly improved, but the deployment process on mobile has become a lot more advanced and powerful over the years. While using Jenkins in the beginning in conjunction with custom-written build-scripts, we since have moved over to use fastlane not only for building our applications but also managing certificates locally & other useful tasks. We also switched over to running our builds on Buddybuild after using dedicated Mac Minis in our office for some time. The same principle of trying to minimize the amount of time spent on infrastructure-related tasks applies here as well. The one thing that did not change over the years though is the use of Crashlytics (now Fabric): since day one we used the platform for both crash reporting as well as shipping betas and still do so today. Every time a new beta build or even just a feature-branch-based build is finished, it automatically gets uploaded and distributed to either the Dubsmash team only or our external beta tester group.

Future challenges

Dubsmash is continuously investing in new technologies and ways of shipping our products to the users. We are currently looking into more use-cases for serverless architectures, deepening our knowledge of React and related technologies, and starting to analyze Cloudfront logs to gain accurate viewership counts, as well as working on recommender systems to further enhance the ways our users can explore and enjoy the content they love on Dubsmash.

Since starting Dubsmash almost three years ago, we have been through a crazy journey of development on both the technical as well the team side. Working on making the quotable internet accessible to everyone worldwide is an exciting challenge that has many interesting engineering problems left to solve—ensuring we are excited to come to work every single day!

If you want to be part of our kick-ass team and join us on our next big step, come join us!

Application and Data