Keen IO is an alternative to your in-house analytics database. You send them events from your mobile app, website, billing system, or backend servers, all the data that's too big to store in your application database. They let you run counts, sums, funnels, averages, segmentation, and visualize this data using their analytics APIs. Keen IO faced a problem this past March: they were getting customers that had tens to hundreds of millions of events that they wanted to query. They quickly realized they needed to re-architect their backend system and what resulted was a distrubuted system with Cassandra and Storm at its core. We sat down with Josh Dzielak, VP Engineering, to get some insight into their stack and what their engineering workflow looks like.
LS: So tell us a bit about what your team looks like and your build process.
J: Our total team size is nine people, and potentially all of us are touching code on any given day. Three of us are full-time engineering. Our CTO, myself, and one of our engineers. Our process kicks off when we push to GitHub, we all use GitHub from the command line. So we push up and then our Jenkins server that we host on SoftLayer, is listening for those pushes and then runs tests. We have a variety of different test suites, both unit and integration for each one of our projects, and Jenkins will run those any time is see pushes to any branch. We use topic branches for major new branches. Because there's only a few of us we don't do any sophisticated branching. If we're doing a production hot fix, we'll probably just push that to master. The way that we get feedback for whether or not the build failed or succeeded is through a specific devops chatroom that we have in HipChat. We don't get emails or anything, we just have HipChat open. It's also hooked into GitHub. We have Hubot integrated with HipChat. We use Hubot to contact Jenkins and then Jenkins does the actual deployments. We have a no-downtime rolling re-deploy for our API servers, the ones that run Tornado. And that all happens within SoftLayer, and Jenkins handles it, communicating with our load-balancer and each individual app server.
LS: Any particular reason you went with SoftLayer?
L: We initially got involved with SoftLayer through their Catalyst program, which is a program that they make available to any TechStars company. So that was the initial hook. But we found that their dedicated servers are really good. Especially on the networking front, they're known for having an excellent network. So it turns out that even though we hadn't built this big java-based network-heavy backend yet, it ended up that those choices really aligned with having to build that system down the road. Because with the kind of workload we have now, with Storm and Cassandra, we would saturate network really quickly in a place like Amazon or Rackspace (or have to pay for the expensive instances). So it's nice to have this predictable profile.
LS: For errors that pop up in production what do you use and how do you deal with them?
J: We send ourselves exception emails through SendGrid, we aren't currently using anything like Airbrake, shame on us. We probably should be because the whole team right now usually gets exception emails. So sometimes we end up spamming the entire team and we apologize for that when it happens (with another email, of course). We have some throttling that we built in. But better exception reporting is definitely on our "cloud service roadmap." We're pretty informal, we don't have a full on ticketing system. We do put those things in Asana. We use Asana for product roadmap stuff but also for bugs and exceptions. So basically, read an email, fix it right away or drop it in Asana.
LS: Do you use GitHub issues?
J: Only for our public client libraries and repositories. Because that's how developers give use feedback, they submit pull requests and issues. We have anywhere between 3-20 pull requests or issues for most of our libraries, sometimes more. Sometimes someone just implements an entire feature and just puts it on our door. So it's really awesome to see the community participate.
LS: Very cool. So how do you triage features and prioritize features?
J: Very informal and very emergent. We don't do any daily or weekly standups. We'll have one on ones sometimes just to chat. But a lot of our philosophy is "the squeaky wheel gets the grease." And so if something comes in and it's small, and it's not bigger than the thing you're actively working on then it goes into a queue. And if it doesn't come up again, then it probably means something else bigger is likely coming up more often and we treat it accordingly. But we still keep an eye on everything. What features to build, that's really all determined by customers. We keep as little of a roadmap as possible and have customers dictate what we're building on a day-to-day basis. If something's a loud problem, we won't forget about it. It helps us focus on the most important thing at the time without deciding in advance what that's going to be. And then we just have to balance that with making sure that we do all the support we need to for other things. There's a lot of things as engineers that we just want to go and build because it would be cool. But doing it this way really helps us keep our eye on the ball. We're bearish on roadmap and bullish on actual customer needs that are coming to us. The founders and myself and a lot of the rest of the team are former engineers or current engineers so we knew that we would have this bias towards building really cool features and sometimes not asking customers what they actually want. So we built it very strongly into our culture, not to do that and to make sure we're out there doing enough support and asking people all of the time for feedback. And that's valuable because with a lot of the companies that we help mentor, we don't always see that. We see a little bit more focus on the product itself and less on customer feedback. We take our marketing, being out in the community, very seriously for a company that's primarily all engineers. And so that's helped us a ton. We took that example from some of the companies that we respect, like SendGrid and Twilio, and tried to make that work inside our own organization. So customer-driven roadmap and also "fanatical customer support," which comes from one of the founders of Rackspace (one of our investors). We've found that that makes them our evangelists.
LS: Speaking of support. How do you guys handle that?
J: Support is a big part of our culture. Both on the engineering side and throughout the whole company. So we do all of our own support, a lot of our time is spent working through support tickets. We don't have a formal assignment process, everyone gets the email and somehow it just works, someone just responds. Sometimes we'll double-respond but customers actually like that. A lot of support is inbound from Olark and from our user chatroom in HipChat. We've seen a lot of success from just having a publicly open chatroom that our users can come and sit in and speak up on their questions. The chatroom is the exact same model that many companies use IRC for but HipChat is nice, you just give them a link and they just land right in it. In the beginning HipChat was just less friction.
LS: So let's talk more about monitoring. Aside from exceptions what are some of the other things you keep track of in terms of performance and uptime? Because you guys are an analytics company so this has to be a big part of what you do.
J: Absolutely. We actually use Keen IO to monitor Keen IO. We don't dogfood everything because you can't monitor a system with the same system for everything. You need to know when it's down, etc. But Keen IO is what we use to monitor everything that works while Keen is running. And that's things like API response times, requests per second at the API level and at the backend query level. We use a product called DataStax Opscenter to monitor Cassandra at a query detail level. We use Storm UI which is a little UI dashboard that comes with Storm that gives you a view of everything that's coming through the system. Our customers have built dashboards for Keen IO to monitor their metrics. We do the exact same thing. So we've built a bunch of dashboards using our stuff that actually do analysis against our collections to produce graphs and monitoring and insights using real events that go through our system. So a lot of dogfooding and it helps us see the good things and bad things. We'll be open-sourcing one of our frameworks that does HTTP response monitoring soon. We use New Relic and Pingdom, even some of that is overlapping. We do that because we need some sort of external monitoring. We use New Relic primarily for server-style metrics. We use our own stuff less for things like CPU, more for things like API requests per second. Things that New Relic doesn't really let you dive into. But they're excellent at giving you things like CPU and memory for the last hour. And Pingdom just tells us about any outrages with the load balancer or the website, light API monitoring as well. We use StatusPage.io to communicate any downtime.
LS: Would you talk about some of the testing you guys do a bit more?
L: Yeah, sure. We have most of our tests in Python, that cover everything in the API. We use JUnit extensively to test everything that we write in Java, both unit tests and integration tests that run full queries back and forth. So we have a way to test every permutation of query our API gets; we have that scripted against the backend database to make sure that it gets everything. And that's just custom homegrown stuff.
LS: In terms of collaboration, anything interesting you're doing there? Aside from GitHub and HipChat.
J: One is email. It's definitely not sexy but we're all really good at it. We've got an email rhythm that really works for us. Most emails go out to everyone and we all have our own way of filtering them. When we do remote pair programming, we're big fans of using VIM and Tmux together. Basically just session sharing right within the command line. That works really well for us.
LS: You guys have been growing pretty quickly, have you experienced any growing pains from a tech perspective?
J: The big obvious one is our movement off of MongoDB onto Cassandra and Storm. We were working on that all year. We started to feel those pains of scaling in February or March. Once we started to get customers that would have in the tens to hundreds of millions of events in a single collection, it was hard to give them good query performance over that dataset when they were using filters and grouping and things like that. To be fair, Mongo isn't really designed to run queries like that. And that was a big thing for us. So we knew we had to provide a new system for this, for better performance. Once you need to go from Mongo to a homegrown distributed system that involves Cassandra and Storm, devops and ops become a much bigger part of your life as a developer. My background is mostly in web development, Dan our CTO, he's done API development, neither of us have really done any hardcore ops before. Once we started to put this new system into production, it's designed to be massively scalable but it has a ton of moving parts and a lot of growing pains. And so the biggest thing for us was that we were on-call 24/7 it felt like and constantly having to do ops. And just having to think about ops when we designed the software. If you're on the web or you're making an API, ops isn't really a big part of your design. But when you go to making really scalable backend systems ops is actually informing the way you build code. That was pretty new, both of us had to wrap our heads around that. At first you fight it, then you just learn to love the bomb. Ops is really annoying at first and then you automate a lot of things. Our Chef repository probably grew by hundreds of lines. We use Chef to automate our server stuff on SoftLayer. We open sourced our fork of the Cassandra Cookbook that allows for extra performance options to be set. Another big thing was the decision to really build out more sophisticated monitoring. A couple of tools were really helpful. Our favorite one is called stormkafkamon, a community contributed library that just tells you how many messages are waiting to be processed that are sitting on your Kafka queues. And that tool is exceptional because it tells us any time that we get behind. And it's kind of a tricky calculation to make. So once we found that we were like "holy crap this exists? This is amazing." It helped us troubleshoot a few things this morning actually. So that was one of the ahah moments. The Storm community is great, Kafka community is great, really helping us address some of these challenges. We also really started to use a Java profiler called VisualVM, that tool's amazing. We've had out of memory issues all the time because the JVM wasn't properly tuned and then that was the tool that we used to deal with it. You can actually run it locally on your Mac and have it connect to the servers in the cloud, SoftLayer for us, and it actually does full inspection of the java virtual machine so that you can see everything that's going on. That was another big moment that stuck out, really helped us start writing more performant stuff.
LS: Any other tools homegrown or otherwise that you use that are key to your workflow?
J: We use Stripe for payments. And the cool thing is, both them and SendGrid have integrations with us, that any of our customers can use and that we actually use. So you can use Keen IO to do analytics on your Stripe data. We were actually just using Stripe and we wanted to do more analytics on those transactions so we reached out to them and asked if they could make their webhook compatible with publishing events to our system, and they were able to do that. So now, any Stripe user can go put their Keen URL into Stripe and Stripe will send all of the payment events to Keen and the user can do the analytics in Keen at a much deeper level. SendGrid is exactly the same thing. Paste your Keen IO URL into SendGrid and all the analytics for your emails flow into Keen and let you do a deeper analysis.
LS: Lastly, how do you guys think about security?
J: We've got firewalls set up of course. But we actually provide some level of data security for our customers. An example is, if we see a delete request for a really large collection we actually send ourselves an email about it and return a response saying "we'll be in touch." Just so no one accidentally deletes their gigantic production collection. If your collection is smaller than 50,000 events we let you delete it. Because we assume you're just developing and testing. But once a collection gets bigger, we protect against it. It might be a little frustrating when you actually do want to delete the collection, but we think it's a good tradeoff. One security feature we provide for customers is something we call 'scoped keys'. Fetching data to make charts requires some form of authentication, but on the web that credential is available to everyone. So it shouldn't be your master key, and if some of your data is sensitive it probably shouldn't be your full read key either. By using a scoped key, which our security docs show you how to create, you can scope data access down to just the data you want the client to have access to. This is usually how our customers give analytics to their customers.