Fog Creek Software, founded in 2000 by Joel Spolsky and Michael Pryor, are makers of FogBugz, Trello, and co-creators of Stack Overflow. We recently launched the open beta of HyperDev, a developer playground for quickly building full-stack web apps. It removes all setup, so you only have to worry about writing code to build a web app. Or, as one user quipped - ‘let's crowdsource a million containers’. Something we’ve done within just a few weeks of launch.
In HyperDev, the apps you create are instantly live and hosted by us. They’re always up to date with your latest changes because changes are deployed as you type them. You can invite teammates to your project so that you can collaborate on code together and see changes as they’re made. To get started quickly you’re able to remix existing community projects, and every project gets a URL for editing and viewing, so you can share your code or your creations.
But perhaps the most interesting thing about HyperDev, is what you don’t have to do to get a real, fully-functional web app running:
It takes care of all that for you, so you can just focus on writing code. This makes it a great option for those just learning to code, so you can avoid some of the complexity whilst you’re just getting up to speed. But it’s useful for more experienced developers too, who are looking to quickly bang out some code and create a product quickly to get feedback.
At Fog Creek, Engineering is split into two inter-disciplinary product teams: FogBugz and HyperDev. Members work across the full stack, QA and Testing. The majority of staff work on FogBugz, but more are being added to HyperDev as it develops. Currently, it’s a team of 8 contributing to HyperDev, of which 5 are full-time - 3 working on the back-end and 2 on the front-end. With additional support for project management, marketing and system administration.
There are 3 main parts to HyperDev: the collaborative text editor, the hosted environment your app runs in, and the quick deploy of code from the editor to that environment. We've found that we need to get code changes updated in the app in under 1.5 seconds. Otherwise, the experience doesn't feel fluid, and you end up longing for your local dev setup. What's more, we need to do this at scale – we think millions of developers, and would-be developers, can benefit from HyperDev, and already we've seen hundreds of thousands of developers try things out, so it needs to be able to do all of it for thousands of projects at once.
The choice of CoffeeScript was mostly due to familiarity and a preference within the team for its minimal code aesthetic. You might not be familiar with Hamlet.coffee - it's the creation of one our team members, Daniel X Moore. It nicely solves the problem of facilitating the use of CoffeeScript with a Jade-like syntax for reactive templating, without having to resort to hacks, like we had to when using Knockout and Backbone.
Within the app we also have npm embedded, so you can search and select packages to include in your package.json file directly. The search and metadata used for that are provided via use of an API service called Libraries.io.
From the outset, we decided not to write our own editor. In general, online editors are difficult to write and although there are complexities in depending on someone else’s design for an editor, we knew that the editor was not our core value. So we chose to use Ace and hook in our own minor modifications to interface into our editor model.
For collaborative editing we use Operation Transforms (OT), to allow edits in 2 or more instances of the same document to be applied across all the clients and the back-end irrespective of the order they are generated and processed. To help jumpstart our implementation of OT on the frontend we used a fork of Firepad under the Ace editor, which uses the OT.js lib internally. This was then interfaced into our own model of the documents and app Websocket implementation.
Using Firepad and Ace was a real boon whilst pushing towards our MVP, as it meant we could direct our dev resources elsewhere and we could leverage the established themes and plugins built upon Ace.
We use AWS for our backend, which meant we didn’t have to commit too early on any given part of the stack from Hardware, through to OS and infrastructure services.
We knew we wanted to provide multi-language support in HyperDev, even though at beta release we'd just be offering Node.js initially. So when it comes to handling users' code, the proxies that accept the requests from the frontend client, processes them and orchestrates the client’s running code is all written in Go.
We chose Go because it is strong in concurrent architectures, has powerful primitives and robust HTTP handling. In addition, several of our stack components were written natively in Go which gave us confidence in the client APIs we would need. Go also had the benefit of being a good standalone binary generator so our dependencies would be minimal once we had the binary compiled for the appropriate architecture.
On the frontend, our proxies have a health endpoint that is pulled out of Route 53 DNS if they fail. These are distributed across our AWS availability zones. It's the responsibility of the proxies, written in Go, to route traffic to either an existing available instance of the user’s project or to place it in a backend node and route to that.
Since all the frontend proxies needed to know the state of project placement, which was fluid over time, we decided to experiment with etcd. Each of the proxies is a node in the etcd cluster so that it has a local copy of state. We were then able to compare and swap atomic changes to consistently route to the right backend instance. However, as we ramped up in the early beta we noticed that there would be periodic hangs in servicing the requests. It turned out that because etcd uses a log appending algorithm, then after a few thousand changes it needs to “flatten” through snapshots its view on the data. So our increasingly busy set of user projects would then trigger this regular flattening of the database, which led to the hangs. So for now, we’ve moved over to PostgreSQL for state handling.
A user’s application is sandboxed in a Docker container running on AWS EC2 instances. We chose Docker due to its strong API and documentation. An orchestration service then needs to coordinate the content on the disk, content changes with the editor, the Docker containers used for installation and running the user’s code, and the returning of all the necessary logs back to the user’s editor.
The challenge here is that some parts of the architecture needed to be fast, with low-latency exchanging of messages between the components, and others needed to handle long-running, blocking events such as starting a user’s application. To get around this, we used a messaging hub and spoke model. The hubs were non-blocking event loops that would listen and send on Go channels. The spokes would reflect the single instances of a project’s content with OT support or container environment via the Docker APIs. This architecture has worked well and enabled us in the early days to split the proxies off from the container servers without too much effort, and a messaging approach lends itself to decoupling components as needs arise.
Post-launch as we scaled up, a few issues arose as we ran into a number of Kernel bugs. So we tried out several OS and Docker version combinations, and in the end settled on Ubuntu Xenial with Docker. This works well for stability under load.
Overall, this part of the system has proven quite difficult to maintain. There’s opportunity to simplify things and leverage Docker Swarm, so we’re in the process of moving over to that. We’ll also be re-evaluating whether Amazon’s ECS can help too, though it may well prove to be an unnecessary layer of complexity over Docker Swarm.
We also use a number of HyperDev hosted elements as part of our backend services. That way we're always dogfooding our own product. This is important for us, as if we want people to trust and rely on HyperDev for their projects, then we should be happy do the same for our own too. This includes our authentication and authorization services, which are the first services used once the frontend is running. So any problems on the backend are customer-facing and immediately impact a user’s experience of the product. This has caused some growing pains over using more mature, battle-hardened options. But it has meant we’ve been focused on reliability from the outset.
For tracking event flows in the front-end we use Google Analytics, which gives us both reporting on specific events users take within the app and a 10,000ft view of the overall activity trends in the app and on the website. We also use New Relic to get an overview of the performance of our systems and application.
For the backend; with any system that crosses multiple system boundaries in a single transaction, it is important to keep a visibility on what is going on. In the early days, system logs worked ok. But as the number of systems goes above a few, mixed in with random placement of projects, it became important to stream the logs off the server. We chose Loggly for this. The wins we got were that we weren’t filling up disks with debug logs, we could filter logs that crossed multiple systems, and with well-formatted logs, we could generate charts and reports.
To keep us organized, and to plan and prioritize upcoming work, we use FogBugz. We previously used Trello for this, but as the number of items grew it became easier to manage this with FogBugz. We also use Google Docs for getting feedback on new feature ideas and marketing plans etc.
One of the core development principles of the team is deploy often, and expose failures fast. We could not achieve that without a continuous integration and deployment pipeline. Once the code is checked into the GitHub repository, Travis CI then kicks off an integration flow for all branches and Pull Requests. Any problems are rapidly identified and injected into our #Dev channel in Slack.
Travis CI enables us to run unit tests with Mocha, enforce coverage using Istanbul, run any compile and packaging steps, and if everything passes in our deployment channel push to staging or production as appropriate.
Travis also allows us to run long running tests while we continue to work. These tests included coverage tests and race condition checking. The latter exposed some issues in our code structure that we were grateful to know upfront, because there is nothing worse than trying to debug an unexpectedly failing process across unknown numbers of servers when it hits a race condition.
And lastly, Travis allows us to compile the binaries, upload them to a repository tagged with the Git commit string. This means that downstream in staging and production we were using the same binary, pulled from the repository, that had passed the tests in Travis.
This approach has been valuable for us as it allows us to focus on developing code instead of deploying. It has forced us to maintain test coverage levels from day 1, and we know that if something does go wrong in production then we have a nimble and predictable deployment pipeline.
With the OAPI needing to interact with so many components and our need to create repeatable and reliable stacks in development, staging, and production, we’ve spent more time on repeatability rather than speed of deployment. So from day 1 we started codifying the stack in Ansible. This was great because our only dependency was SSH, and we could just as easily run this against our development environment in Vagrant, as well as staging and production in AWS. The downside to the approach is that it feels like we’re behind the curve in terms of speed of the deploys because we have something that works, albeit slower than we might want.
Overall, we’re happy with our stack. We’ve had to learn a number of lessons quickly as our launch brought more than 3 times the number of users we had anticipated (but that’s a nice problem to have!) However, no early-stage stack is perfect and we’re continuing to refine and try different options as we continue to scale up, improve speed and performance of the service and deliver the rock-solid reliability our users deserve.
So next time you want to write a quick script or prototype a new product, then remember to give HyperDev a try.
Check out the HyperDev Stack.