How Authy Built A Fault-Tolerant Two-Factor Authentication Service

Editor's note: Daniel Palacio is Founder at Authy. He holds a BS in Computer Science and Management from Purdue University.

Authy

Authy is solving an increasingly important problem: making security simple and reliable for consumers and developers. Consumers are recognizing the value of strong authentication and security for web apps they're using. Developers traditionally have not bothered with the technology because of the added complexity. Authy was created to change that and make two-factor authentication available to everyone. They're well on their way with companies like DuckDuckGo and Coinbase using their service. As a response to recent security breaches, Authy even decided to open source their two-factor authentication VPN tool because they want more people using secure systems. On the heels of their newest product release, we sat down with Daniel Palacio, Founder of Authy, to talk about how they built their service.


Authy's Tech Stack

Languages, Frameworks, & Open Source Tools
Ruby Sinatra Padrino Chef OpenVPN
Cloud Services
Amazon EC2 Amazon S3 GitHub CloudFlare Mailgun Splunk Storm Stripe Twilio BulkSMS Crittercism Blossom HipChat

Jump to the cloud services


LS: So how did the idea for Authy first come about?

D: I left Microsoft's security team around June 2011, about 6-8 months later I was thinking about what to do next in the security space. My idea was to build something that combined a really good user experience with a robust security technology. There's a lot of security companies for enterprise out there, but consumers were really behind. So I wanted to build a company that solved some of my problems as a consumer. My biggest problem at the time were passwords. I had a bunch of passwords, I kept trying to change them and coming up with schemes to make them very secure. But that was a real pain and I thought two-factor authentication was really cool. The only problem at the time was that two-factor authentication was also a real pain. I used to travel a lot to and from the US, so I had to change cellphones a lot. Every time I did, I would be logged out of all my accounts. I just wanted to create something very simple for me to use.

LS: And so this was before two-factor authentication was popular.

D: Yeah it wasn't popular at the time. Some people were using it, Gmail had it and a few others. So I just thought, if I make a very simple product for consumers and I open an API to developers, I bet a lot of people would want it.

LS: Right. So this is a problem that everyone has had, password overload. The issue is that it's been so difficult to take advantage of these better security systems like two-factor auth, so people don't even bother. So your consumer approach makes perfect sense.

D: There were a bunch of products for companies. Like if you were a company and wanted to buy two-factor authentication for your employees, there were a bunch of solutions. So I didn't think there was a pressing need for another enterprise solution. But there was a big need for consumers and end-users.

LS: Let's talk about the technology behind Authy, starting from the bottom of the stack.

D: The initial version was just a Sinatra API. I really like Ruby and I wanted to create a really simple and fast API. We chose Sinatra because, at the time there was Rails and I think something else called Grape, but although grape was simple it wasn't very robust and rails was anything but simple and I just didn't like Rails at the time too much. In fact, we use a lot of Ruby today but there's no Rails in our stack. There's a demo done in Rails but no actual Rails code in production. Sinatra on the other hand I loved. It was and is perfect - robust and lightweight. The initial version of our API was Sinatra plus about only 40 lines of ruby code. It was very light.

So we had the Sinatra codebase for the API, then on top of that I started building mobile apps. We decided to go native on every device, Java for Android and Blackberry , Objective-C for iOS.

So then we wanted to build a dashboard tool to administer the whole thing. We decided to build it on something called Padrino. Padrino is like a lightweight version of Rails and a heavy version of Sinatra. Sinatra comes with almost no defaults, Padrino is like taking Sinatra, putting some gems on top of it and making them work well together. So it's in between Sinatra and Rails. So we built authy.com on Sinatra and then dashboard.authy.com also Sinatra. The neat thing is that we built all of these apps using the API. So we're very focused on dog fooding our API. We thought, we should build the dashboard as if we're an Authy customer and just using public API's. That's how we function, we have one person building the API and the other building the dashboard but they don't really need to talk to each other - everything is done via public API's. The dashboard is completely reliant on the API, it has it's own API keys - just like any other customer. Same with the website, it has it's own API key. So if the website were to get hacked, the impact to the API would be almost negligible, because it would be like if anyone had an API key, which anyone can for free anyway. So that's been good from both a security perspective and helped us dog food our API's.

So about 4-5 months after we launched, we were deployed on Amazon Web Services. The first version was just a machine on AWS running Unicorn, and I think NGINX was on top of it. So NGINX with Unicorn on the same machine with a PostgreSQL and our App. So it turns out having one machine is not a good idea. So we started designing an API that had no single point of failure. We got help from a friend of mine who used to work at Engine Yard. The idea was there were going to be three layers. One layer was going to be the HTTP/Web service layer, which was going to run NGINX and Haproxy. This layer would terminate the ssl and load balance the HTTP connections. The second layer was the application layer that would run our Sinatra App using Unicorn. The third layer would be the database layer. Lastly, for security reasons, both the application and database layers would be behind the firewall inside a VPC, so they wouldn't be connected to the Internet and only the web layer machine could talk to them.

And we decided to have three machines per layer. So if one failed the other two would takeover, if two failed, one would take over. So overall, nine machines at minimum. The idea was that DNS knows if one machine from the web service layer dies, it takes it out and it doesn't appear in the DNS records. If one of machines in the web application layer dies, HAPROXY would detect it and take it out of the pool so it doesn't query it any more. And if a database goes down, the app knows and it won't query it anymore. So automatic failover, all over the place.

So we started building that and we failed. The reasons we failed was because PostgreSQL was impossible to manage for automatic failover. We tried a bunch of products like pgpool but they were slow, required a master-slave relationship which meant at some point if master failed we had to promote a slave to master and that just never worked. We kept trying for about a month doing a lot of tests. We would kill a slave db and see what happened and most of the times it wouldn't work. Then we'd kill the master and when it worked it would take about 20-30 minutes, but most of the times it simply didn't work. So we looked at other database technologies and we found MongoDB. And the nice thing about Mongo is that they have this internal protocol, called replica set, where db's talk to each other so you can have multiple db's, like a pool, and they would handle replication and everything automatically. The is not master or slave, all db are replicated equally and it just "worked". We killed and added machines on the fly and it worked beautifully - very fast and very robust. We had to build a couple of things to make sure the data was consistent across different db's, but other than that the failover worked great. And for an app like Authy, were uptime needs to be close to 100%, doing anything manually is impossible. If you calculate the time you can be offline have %99.9999 uptime is less than 2 minutes per year - so you need fast automatic fail-over. And so we moved all the databases to MongoDB and we created this MongoDB replica set.

LS: So was everything living on Amazon?

D: Yeah so everything is hosted on Amazon. We decided on three machines right, so the idea was if you're on Amazon West VA, they have four availability zones. Each machine is in a different availability zone. We actually tried to do multi-region and the problem with it is that you need public IP's to talk to each other. So with Authy you have the web layer which has an Internet address, but the rest is in a private cloud behind a firewall, for security reasons. If you wanted to do multi-region, you'd have to take that private cloud and open it. And the other thing is, if you try to do a MongoDB cluster across different regions, due to latency things just start failing. I don't know why, it just fails. The latency is pretty big, so if a user tries to create a record on Virginia, and you have to send it to Japan to verify, the record may not have arrived to Japan yet, so you get all these kind of failures. And they're really hard to diagnose and fix. So we decided that as long as we have three availability zones in EC2, the probability that we fail is very small.

At the same time, we were learning a lot of Chef. All of our infrastructure is programmed in Ruby using Chef. So you can deploy the whole infrastructure across different regions in about five minutes. And to make it multi-region, we decided we would have a region in Japan just ready to come up with about a five minute delay. So let's say the whole EC2 Virginia region goes down we could spin up Japan in one or two minutes, point the DNS to Japan and work out of Japan. Of course, we would lose some records in that process but it would be like five minutes of records, which wouldn't be that bad. Our goal is to never be offline for more than a minute or two. We've had about 99.99999% availability since we built this architecture. The only thing we had, our DNS service went down for about a minute. Having multiple DNS services at the same time is pretty hard, so we wrote a couple of scripts to keep the DNS records in sync between two DNS services. We're using CloudFlare, which is the fastest one. It's really really fast and really secure and it also has an amazing DDoS protection which is really good. And as a backup we use DNSimple.

LS: Wow. So you've built a super-fault tolerant architecture. This is pretty cool.

D: Yeah, so the thing that people don't realize is that not having a single point of failure is really, really hard. You have to look at your DNS, email services and pretty much every service you use. We send emails through Mailgun, but if there's an error from Mailgun it automatically gets sent directly through Google mail. So we have two different email providers. Pretty much for everything. We send SMS through Twilio, but if that fails to send a message we have three other providers that kick in automatically. So it's a very robust architecture. Every service that we have, we have at least one backup at all times. So again, it's very hard not to have a single point of failure, but it's achievable.

LS: Have you needed to use the backup services very often?

D: Not really. The reason we have so many SMS providers is because sending an SMS to Colombia is different from sending it to Argentina. So maybe Twilio is great for Colombia but not for Argentina for instance. So we try to send them through the service that we think will have a higher probability of successful delivery. BulkSMS is really good for sending SMS to Latin America.

LS: So you guys have had to write a lot of different scripts to have this automatic failover. But sometimes you know to go directly to another service, like BulkSMS.

D: So right now, if you were to look at our private GitHub, we have like 48 different gems and they handle everything. We have a gem that handles SMS, it's super intelligent. It will kick off an SMS, after 10 seconds if it wasn't delivered, it will send it through another service. Very robust.

LS: So how long did it take you to set up this architecture?

D: The first version for hosting took us two months, coding it in Chef. From there it took us two-three weeks to make the SMS gem. Email was pretty simple since it's only two providers. And the past year we've been optimizing things. There's still a lot we can optimize. We're doing a lot of data science on SMS. Downloading all the SMS logs from every provider, looking at every error we received, how long it took for the message to arrive.

LS: So we have hosting, DNS, email, SMS. Are there any other pieces of your architecture that you've set up failover for?

D: The firewall is also very automated. If we receive attacks, it will detect them and automatically send our firewall traffic over to Amazon to block the person. That's very important for us because we used to get a lot of scams. So we started looking at which IP's are hitting the API, how many times, and it will block people off.

We send a lot of data over to Splunk. We've sort of instrumented the whole API. We look at the logs, and you can write data parsers for Splunk. So we parse the data automatically and we built these dashboards that tell us which people are attacking the API, from what IP, what is failing, what is not failing automatically. We look at this dashboard constantly to try to improve things.

LS: What about CDN?

D: We use CloudFlare for that too. Some of the APIs, they always return something different so we couldn't use CloudFlare for those. But for our website and blog, we use CloudFlare.

LS: So you're basically saying you should be using Authy to secure your systems and you back it up with this robust architecture that's secure but super reliable and fault tolerant.

D: Absolutely. We put a lot of priority on security. So if we get hacked, we have all these processes to minimize the hack. So our goal is to reduce the impact of any compromise to almost zero.

LS: Can you talk a little but about how Authy actually works, the whole end to end process.

D: The whole system relies on two APIs: one is to add users, the other is to verify the token from a user. The first part is the web application which sends us a cell phone and an email address. That returns an id, which is unique per user. It's a public ID that anyone should be able to see, there's nothing inherently secure about the ID, it's just a consecutive serial ID. With that ID, the application can query different tokens to see if they're valid at the time of the query. Send a token and send an ID, we'll tell you if it's valid or not. And that's the whole idea behind two-factor authentication.

All of this data is encrypted and stored in MongoDB. So we take daily backups twice a day. We take this backup, encrypt it and store it in S3. For every backup, we choose a 256 bits AES symmetric key, encrypt the database with it and we use a 2048 bits public asymmetric RSA key to encrypt the symmetric key. We store both, the encrypted database and encrypted symmetric key S3. We have the private RSA key on USB key outside of the Internet that I store in a secret place. So if we ever need it, we just plug it in to a secure computer, decrypt the symmetric key and decrypt the database and upload it again.

LS: Let's talk briefly about your build process and how you ship code.

D: Our testing is local, RSPEC for everything but we also have a couple of scripts that test on production after we deploy. On EC2 and S3 we've replicated production and created a staging environment. So the first deployment goes to staging, run the tests and scripts, if everything is fine, it goes to production - sometimes only to 1 production machine. For security reasons, you first have to log in to one of the application machines. You do a git pull from there, and that deploys the code. So it's not push, but pull. And the way you get to that machine is a bit complicated, but thats also for security reasons. Only one person does it so unfortunately if you're a developer you can't just deploy to production - also for security reason. So developers push their code to branches. And that's all pushed to our own instance of GitHub - we're big users of GitHub (the enterprise version). Then one person does the code review, runs the tests and deploys to production. Most companies you can just push to production using git push, but that just wouldn't work for us - it would simply be to insecure.

LS: Has this slowed you guys down at all?

D: No, we don't ship to prod very often. We try to work in one week sprints. If we want to deploy a lot of code to prod one day, someone will do it. But we try to avoid that, because that's not the type of system we're running. We want a very well-tested robust system that always works. If you're shipping constantly and you're breaking stuff, it fine if you're doing a photo sharing site. But if people cannot log in to their VPNs because of us, that's pretty bad.

LS: Right. What do you guys use for monitoring?

D: For the API and dashboard, every time there's an error we get an email and we fix it right away. For mobile apps, we use Crittercism, which sends us crash data.

For performance, we use Splunk. We can do very complex queries, like which request took the most time, etc.

LS: Any other tools or services that are key to how you guys work?

D: Yeah, we've experimented a lot with collaboration tools. The way I did it is if I found a service I liked, I would just buy it and give it to everyone. If they kept using it, then I'll keep paying for it. If they didn't then I wouldn't. We used a bunch of chat services, the only one that stuck was HipChat. We connect it to everything: GitHub and we have a pretty awesome service that we use now called Blossom. Which is product management, and we connect it to HipChat too. We can see when someone adds a new feature to Blossom, it sends a message out on HipChat and everyone can see it. HipChat and Blossom are the two tools we use constantly.

We're also using Stripe for payments. But once people move into a yearly plan, we bill them for the whole year through a wire or credit card and we process that ourselves. But if we weren't doing that we'd use Stripe. It would just be expensive. But one cool thing that worked very well for us is, in order to enable SMS sending and reduce fraud, we always ask for a valid credit card, and Stripe is really good for that. I think that asking for a credit card is avery good fraud prevention method.

LS: Cool, so I think that covers the tech side of things. Let's talk about the security space more broadly and how you guys fit in. This whole idea of being reactive as opposed to proactive. Right because I'm sure that's one of the challenges you face as a company.

D: Definitely. So people didn't use to use SSL certificates. Nobody cared about them, but nowadays everyone uses it. I think that's what's happening with two-factor authentication. So what we did was just wait. We knew it would be big in the next few years. Like it will get to the point where it's the norm and not the exception. You're gonna launch a site with two-factor auth.

LS: You recently open sourced your vpn two-factor auth tool, why?

We kept OpenVPN closed source for a while. But what we've noticed is that people just want a simple way to just secure their services. And if they don't find it, they won't use it. And people are getting hacked because a password was compromised. That shouldn't happen, that's too easy for people. So we decided to be as open as possible from now on. So every different component we build, we'll just open source it and charge for our API plans only.

It was this realization that the best thing for us is that more people use two-factor authentication. We can monetize in the short term by being very exclusive and only dealing with companies that will pay a lot of money. But it's just a matter of time before our cost per user becomes very minimal and we won't need to charge a lot per user and at the end of the day, we want people to have more secure systems and it will play out how it plays out.

LS: So what has the response been since you open source your VPN tool?

D: The response has been very positive. We definitely want to push it more and get more people to hear about it. But a lot of people still have very insecure systems and deployments. The amount of people that have VPNs, if they have it all, is still very small. Most people just log into their production machine that's directly out on the Internet. Some of them just use a password. I think in the next year those people are going to start to realize that that's not a good long term strategy. You have to protect yourself better. Getting compromised is something I don't wish on anyone. It's hard to recover from it, people will remember it, and it's hard to gain that trust back. And it's something, where nowadays it's not that expensive to protect yourself well-enough.

LS: Right, so it's almost like, in a few years you won't be able to sign up for a service without your phone.

D: And people wouldn't want to. Like my mom, even she knows how to look for SSL on a site before entering her credit card. If she doesn't see it, she doesn't trust it. She doesn't know about SSL. She just heard that if you're going to type in your credit card, you look for the green bar on internet explorer. So they figured out a way to make SSL consumer-friendly. And that's what we're doing for two-factor auth.

Bitcoin is a big user of two-factor authentication. People that use bitcoin really like two-factor auth. So I think it's just a matter of time before it spreads to all services and people will want to use it. Having different passwords for all your services is just hard. Like people on Coinbase really like it, they feel safer when they use it. The safer they feel, the more bitcoins they're going to buy.

LS: Right, so when you really think about, for consumers you guys are a user experience company. That's what you optimize for and that's the problem you're solving. It's not a question of should people be using this, it's about making the experience so painless and easy that people will adopt it. Because there's no reason for them not to, other than the hassle and experience.

D: Right and that's what we focus on pretty heavily, improving this experience. If you lose a phone, you can easily get all your accounts back. That sort of thing. That's the hardest part about two-factor authentication. No one says two-factor authentication is not as secure as they wish it was. They always say "yeah it's secure, but it's a pain in the ass." So that's what we're focused on improving.