"The first time we launched our beta, then it was like it so many people using it, it could only go up to like 250 concurrents, or less if they had slow connections because there was a request timeout that caused our CPU utilization to reach 110% but didn't crash the node.js process. We could restart it, but it was just sitting there and not doing anything and then no one could play."
"We were having fun though, it’s really exciting to just be in this area where the whole community is really vibrant, I think of it as it’s like a genie with infinite wishes because whenever I think man, “I wish I had a library that did something,” and then I search for it and there it is. I just have to add it to the package listing, run NPM install"
"We would have picked a service that does a lot of the scaling for you...I was pushing for the adoption of these PaaS solutions early"
Node.js ShareJS Brunch CreateJS MongoDB
Amazon EC2 Amazon EBS Amazon Glacier Amazon S3 Amazon SNS Firebase CloudFlare GitHub Travis CI Ink File Picker Sendwithus MailChimp Olark Algolia Errorception New Relic PagerDuty Papertrail Leftronic Mixpanel Google Analytics Heap Segment.io Zapier Pixelapse HipChat Google Apps Zen Payroll Trello SETT Discourse Highrise
LS: Let’s start at the beginning and talk about how the idea first came about and then go from there.
George: I was learning to code and I used Codecademy and I used Learn Python The Hard Way. I’m not even sure. I used a whole bunch, probably about a half dozen over six months and it felt like I was going back to school. Learn Python The Hard Way was probably the best. It’s the least directive and it just felt more realistic to me. It was good but the problem was that it was like... every evening when I got home, it felt like I was back in class. And there was a reason I didn’t get a master’s degree, because I don’t want to go back to school, I want to build things.
After a while, I was noticing this trend, though there were two trends. Our first startup, Skritter, taught students of Chinese or Japanese (languages) how to better learn or remember their characters. It’s a vocabulary acquisition tool. So the first trend was that customers from Skritter kept telling us that they played it because it was like a game, which was unusual because everything about that product was built to be un-gamified.
We made every decision with that product to make it as hardcore and non-gamified and non-sharable as possible. We were focused on “How can we maximize memory retention?” Well, you’re going to have to understand an exponential decay curve. “All right. Well, that’s what we’re going to do,” and people kept telling us that it was like a game and we were "What the heck are you talking about?” There were two or three things that we did to the study screen. One was that we added a timer and then there was the number of characters studied and the retention rate.
So people were reading the scores because it was your score that you’re keeping. That happened and around the same time, I started gaming with my brother every week and we were playing Borderlands. We were playing all sorts of pretty big titles and I realized that I would make time to play the games with my brother, but I wouldn’t make time for my learning-to-code time because it felt like I was in a class. I was like, “Wow.” All right. Our previous customers are using our products because they think it’s a game. I am making room in my life to play games where I could be learning. Games are motivating. What if we actually built a product that taught something that was also a game?
That’s where it originally started and I pitched these guys on that and they were pretty into it. And we knew each other because of our former startup, it was us three. We’ve been working together since ’08 and I’ve known these guys much longer.
We actually ended up pitching one another. The funny thing was that we were all like, “I really want to do another startup together.” We got a lot of ideas together and we had a startup idea sort of Gladiator Competition and I said I wanted to do something unsexy and profitable. And then we made a game to teach people to code.
We started in December 2012, we made the decision then. I was like, “All right. I’m going to quit. I don’t know exactly when, but I’m going to quit,” and Nick had already started hacking together a prototype. He was testing and putting together our stack, even before Scott and I got in there.
Scott: I still remember I’m looking at the console logging like, “Why did that show up in the browser and not the server?”
Nick: So it was this horrible time for about two weeks where nothing was working. And then, I finally figure it out and it all starts working and I’m thinking, “All right. This is the client, server, real time multiplayer. I got my spell editor. I can move that thing. I can simulate the world in a web worker and it goes to … " All right. It’s going to work. I go to show Scott and he asks, “Wait. So what’s on the client? What's on the server?" I'm like, "I think... this?”
Eventually, we've got our Backbone versus Node figured out. We have our Brunch actually compiling our Sass and our CoffeeScript.
LS: So how did you decide on ShareJS?
Nick: I think I was just Googling for something. I was searching for operational transform because I knew that Google Wave had this. I was thinking, "there's got to be something where someone has written a library where I can just get the operational transform and then I can start going as opposed to having to write that, which is hard."
Scott: And ShareJS was created by an ex-Google Wave employee.
We gradually, started piecing things together. I also had this learning curve, just getting over the whole system that we had before that was based on Google App Engine and Django and Python, that's what Skritter was built in. So that’s what we were all coming from and having to each figure out how the structure worked, like where the model and the view and controller were and so on.
Nick: We wrote most of the initial code for Skritter, about five years ago. So everything that we knew about running servers is … well, we don’t run them, App Engine runs them. And everything that we knew about architecting an application is from Django from five years ago.
And so App Engine wouldn't run node, so we went with Nodejitsu because that's supposed to be Google App Engine for Node.js, we were on there for a while. Saw a bunch of problems and then eventually Michael comes along and he says, "Well, you know, I'm only a sophomore in college and I've spun up a bunch of production AWS clusters so I could probably set you up a load balancer, sharded application clusters..." And short story was, we were like "Yes. All of that, yes."
Scott: So at some point I set it up on a Linode server and that worked until we started getting massive traffic.
Nick: Yeah. The first time we launched our beta, then it was so many people using it, it could only go up to like 250 concurrents, or less if they had slow connections because there was a request timeout that caused our CPU utilization to reach 110% but didn't crash the node.js process. We could restart it, but it was just sitting there and not doing anything and then no one could play.
Scott: So control C, Up, Enter over and over again.
Nick: And then we went to Startup School. So George and I are pitching Paul Graham on stage and Scott's like, “I’m so tired."
LS: Right, so this is what you guys were referring to on stage ( video clip here)
Nick: Every 30 seconds.
George: We wrote a Python script to just do that automatically. To restart the node server every five minutes or so I think.
Nick: But now Michael's here and that's over with. And he can talk about the actual AWS stack.
Nick: But after we moved to AWS, basically, we’d be like, “Hey. Look, a web service. We can finally use those now.” With App Engine, you can only easily use stuff that works on App Engine. But now it's like "we can use this, we can use that..." Only half that stuff eventually made it, but there was a time when we were definitely too greedy and it was like a kid in a candy store with all these third-party libraries.
Scott: We were having fun though, it’s really exciting to just be in this area where the whole community is really vibrant, I think of it as it’s like a genie with infinite wishes because whenever I think man, “I wish I had a library that did something,” and then I search for it and there it is. I just have to add it to the package listing, run NPM install, tell George to run NPM install, and then it’s done.
LS: So you guys had multiplayer even in your prototype right?
Nick: Yeah. We had the multiplayer about three weeks in.
Scott: That was our first chosen library, yeah. Really wanted to try that operational transform.
Nick: It wasn't quite working, people would start typing and we'd notice, they'd start typing one character and all of a sudden eight characters appear, it's all crazy.
Scott: It's a really complicated algorithm.
Nick: Yeah. Eventually, we could've waited for ShareJS to update to the newer version or we could just go to Firebase and put in an adaptor for that and it's been totally stable since we did that. So we started using Firebase a couple of months ago.
LS: So what was your process for launching the prototype?
Scott: I think we launched the prototype in June. So from January to June we were building the prototype and testing it.
Scott: Yeah. George was still in North Carolina and so we did a fair amount of it actually over there. Nick and I would travel over there. We would spend about a week and a half or two weeks just working on it and doing a whole bunch of UX testing.
Nick: Paid college students to come and play it.
Scott: Or friends.
Nick: Not pay our friends to come and play it.
Scott: I think we did some Craigslist stuff.
We didn’t really show it around, I think online until then, until June. We were lining up some UX testing and that worked fairly well, at least in the beginning because we were just trying some of this out.
Nick: The first time that we actually got a good amount of traffic, George had this link in a blog post that made HackerNews.
George: Yeah. It was number three or something for like five minutes.
Nick: It was this tiny link to CodeCombat and it’s the first time the Internet got a look at it, but we still had it so that whenever you joined, everyone was in the same multiplayer session. So basically about 60% of the people are trying to code in the Olark chat box. So they were like chat coding in a chat box. Then thirty percent of the people are trying to code in the real thing and and totally writing over each other's code. And then ten percent of the people were mashing the "Reload All Code" button. Everyone was swearing at each other, 100 players in level 1.
Scott: That wasn't that long ago though. That was like...
Nick: That was June.
George: No. That was in August or something. Anyway, it was after June. I remember it because it was really hot and really raining and it’s neither of those here.
LS: Cool so let's talk about your current stack and when you guys moved to a new architecture.
Michael: I'd been talking to Nick about CodeCombat since last summer, but right after Startup School and the server was going down and he said, “It would be really great if you could build us something better really fast.”
Scott: It was random. Like normally, we would expect, “Okay. We got a big spike,” and then it’s going to nice and evenly go down, but it was just for two weeks, it was random spikes when apparently one country after another found us. A few hours after Startup School, I only had to restart once during Startup School. It went up then it was fine. I was like, “Perfect.”
And then suddenly, there was a ton of traffic because apparently someone posted us on Facebook in Brazil and everyone in Brazil just came by. The whole country invaded and just took the site down and we were just constantly restarting it and then Poland arrives a few days after that and then I think France and some people in Spain, Ukraine.
Scott: At some point Mexico comes along so it was one country after another. Generally speaking, we’re wondering, “Which country is invading now?”
Nick: And Michael is like, “All right, guys. I can fix this.”
LS: So you knew you didn't want another PaaS.
Nick: Well, we did consider Heroku but then Michael was like, “I know AWS."
Scott: Also OpenShift. It seemed cool. I was like, “I want to try this guy,” and he’s like, “No.”
Michael: I had used AWS in previous projects and their API is really good. The setup is pretty simple. We use CloudFlare as the CDN, so a lot of the static stuff is served from there. As for dynamic content, requests come into an Elastic Load Balancer and that load balancer is in front of a number of application instances.
Neither our load balancer nor application layer is sticky, so we can just take off or put on application instances to scale with load. Then for our database setup, we have one large machine running MongoDB as the primary and have a few replicas; one is in the same region, but a different availability zone, and then another one is across the country. We thought about sharding, but we haven’t had performance issues that would benefit from us doing so; we might do it in the future if some of our planned database-heavy features start to take off.
We also have one small machine that coordinates everything. Initially, when I was wondering about infrastructure management solutions, I had heard of Chef and Salt, but I didn’t know either well enough at the time to quickly implement something. I felt the problem was simple enough that I just wrote a solution in Python (essentially a layer over boto and paramiko. Those scripts will be open source once I finish refactoring them to be more general and easy for people to benefit from.
George: You left out the cool part which was right before Hour of Code (hosted by Code.org), we were told to prepare for anywhere between 2-4 million uniques. We weren’t sure how many and they weren’t sure how many and we had just gone through the Startup School thing so we were like, “Let’s not have our server get pegged.” We say, “Michael, we need a lot of steel.” He says, “Whatever man.”
We had all this prepaid credit with Amazon that we arranged and we were like, “We need something big. Put it together,” and I don’t know what the details were, but I remember we had like...
Scott: Six hundred tops people at the same time playing.
Scott: We got about 180,000 unique visitors...
Nick: In a week.
George: So we had a big database server and six applications servers
Scott: And none of them were over 3% CPU usage.
Nick: Yeah in the middle of it, we checked our CPU utilization and it’s at about 1% and it was under full load, tons of people on the site and we were like, “Great.” Then we get this message from Michael and he says, “Wait, wait, guys, guys, these are operating in single core mode at the moment.” We’re like, “Well, it’s under 1% utilization." He says, "Nah, I'm enabling quad-core.” Uh... I guess you could do that if you want. "I got all cores working on this now." “All right.” I’m like, “Good job.” Now, we’re at .01% utilization. Anyway, it was a big scale up for that.
Michael: I think we're at something like a 13ms mean response time. We should be good for a while.
LS: So you guys are happy with your infrastructure now. What would you guys have done really differently, if you had to do it again?
Scott: We would have picked a service that does a lot of the scaling for you. Maybe Heroku or one of the other PaaS solutions.
Nick: The problem was, when we were working with ShareJS we were thinking, "Oh man. Which of these PaaS providers is going to let us use websockets?" Some of them, the ones we would have looked at, I was thinking I don't think we could do ShareJS on them, and I think actually I was wrong. I just misread it or whatever. So we ruled out a lot of PaaS options early on because of that one issue.
Scott: What we're trying to say is ShareJS has had a large effect on us, and is now gone.
George: I was pushing for the adoption of these PaaS solutions early because in our experience, the whole experience with Skritter early on was that we were like. "Wow" We really don’t know what we’re doing with servers. We really have no idea about load balancing, any of this. App Engine is very expensive but it's cheaper than hiring someone to manage it for us. So we just paid it and after a while, I really enjoyed that because for instance, when we were doing all that switching to Linode, AWS, I was like, "Guys, the site is unstable. What is up? We didn't have this problem when we were on App Engine.
Nick: Oh we totally did ...
George: Yeah. I don't remember that in '08, but there was some pain in there that I wasn't used to because like, "Why don't we pay someone, Heroku is here ready to take our money, give it to them and ... "
LS: So you would have preferred a managed platform?
Nick: Yeah. I mean Michael is probably the only guy here that has actual server admin experience. If we'd done that previously, I don’t know. I mean, running this close to the steel is great for cost, you get a lot of cost advantage. If we were to run own server, locally. We'd save money. But then we need to manage failovers, regions...
Scott: And there's all these open source libraries. We don't want to have to write CreateJS, that's another thing we use by the way. CreateJS, we just use it for all of game text stuff interfacing the canvas and sound, preloading.
Scott: There's a list of other libraries and things- on our GitHub Wiki.
Nick: We use a ton of services too: