This is the first episode of Stack Stories, our new podcast, where we highlight the world's best software engineering teams and how they're building software. Hosted by Siraj (Stack Stories Head Honcho, Sirajology) and Yonas (CEO at StackShare). Follow us on SoundCloud.
With their 2017 IPO quickly approaching, we sat down with Jatinder Singh, Director of Platform - Engineering, to talk about the stack that got them there. From the early days of Appcelerator to their current Rails/Node stack. Hear all about their early scaling challenges and what's next for the soon to be Unicorn. Listen to the interview in full or check out the transcript below (edited for brevity).
- Background
- Minimum Viable Product (MVP)
- Current Stack
- Scaling APIs
- Database Challenges
- Devops Stack
- What's Next
Within two months, the team back then launched an app. I think Titanium is what they used and a lot of CoffeeScript.
Within two months, the Rails backend to an interface for hotels to manage their inventory and also in terms of staff. Within two months they were able to launch it and go to market was really fast. That was back then. That was the iOS app and Android I think back then they used HTML5 and jQuery. Eventually over the next few years, what we've done is we formed iOS and Android teams, really, really good experts and have converted those apps to native.
Overtime we've evolved that into more Rails apps and also recently, we've started looking at Go as tiny services to glue ... You have to get information from one place to another so instead of putting that code in the monolith that we've been trying to break apart, why don't we try something like Go. Node.js and other platforms for mobile web that I mentioned a while ago, Node.js is another platform that we started investing in… we have an app [for hotels]. When I say app it's really a desktop and mobile web app, not a native app but that and the mobile web app for the consumers, that uses React and a combination of Node and Rails.
Back in 2014 we used to be a same day service for booking hotels. As a customer you could only book hotels the same day. You could not book in advance. What Hotel Tonight would do is we would push, we would send a push notification at 9am to all our customers. It's a self-inflicted DDoS. Suddenly at 9am you see such a spike of traffic, customers trying to come to the app and trying to see what's available.
Then we started looking at what was the bottleneck in the stack? That was a reason for not being able to scale. One of the bottlenecks was MySQL...We started looking around at different technologies. Elasticsearch came to our attention…We explored that and now we have Elasticsearch in our stack... Then we built a schema-less append-only store on top of MySQL. Pretty much on the lines of what FriendFeed did... That allows us to horizontally have more partitions and shards and also just have it schema-less. Storing different versions of the same room or inventory, so you're not overriding. You're not overriding a single row so you're basically creating different versions of it.
We moved away from sort of the "callback hell" that people would run into, in the Rails models...The next step we've started looking into is event streaming. So what would happen is, let's say a customer changed one of their attributes and now the rest of the systems need to know about that change, previously we would have the first naive implementation of that is you have, 5 or 10 different callbacks which do the same thing. We are evaluating Kafka and Kinesis right now.
Siraj: How are you doing today? How's life?
Jatinder: Doing great. Couldn't be better. Just as the quarter is beginning, a lot of things at Hotel Tonight that we are looking to get into, so couldn't be better.
S: Awesome. Do you have a routine you do everyday in terms of waking up and just getting in the mood?
J: Usually I'm in early, and so the energy levels are up. Yeah. First things that I typically do is look at the performance in New Relic, how was the day yesterday, what was the response time of our average response, we have 95% response time of our key APIs. Also in the company, there's a single e-mail that goes out to everyone in the company that's like, "Hey, how did the company do? How did we do as a business?" Looking at that everyday is kind of a reminder that we are on a mission here.
S: Who sends that e-mail?
J: It's an automated e-mail. We use a tool called Looker. Looker is as an analytical tool that we can use on top of databases like RedShift, and you can send scheduled reports. One of that is scheduled everyday.
S: From what I know about you and what you've said so far, it seems like you have a pretty good gig here. We've got the Equinox, got the Hotel Tonight in San Francisco, a short commute, all good things. What got you into computer science in general? What made you decide you wanted to be in programming?
J: The programming world definitely excited me back in the day. In the 90s hooking into dial-up to connect to your internet and waiting for 20 minutes to hear a song, or even more than that. Those were interesting times. It really fascinated me right from the beginning, and pretty much after that, on the side, I started looking at C and Unix were the introduction to me.
S: It's always interesting to hear programmers talk about one of the first languages they worked on and usually there's some amount of reference and it's still the best but I feel like you must feel differently now, you don't feel as though C is the go to language these days.
J: Absolutely. I think there's no one size fits all. Especially depending on the situation of the company where they are. If they are just getting started, C probably is not the right tool, not the right technology. Even at Hotel Tonight and some of the startups that I've worked for before, Ruby on Rails for example, is a perfect fit for when you're getting started. Eventually you may run into very specific challenges. That's where something like C could be really powerful, these days could be Go which could be really powerful.
Depending on the job at hand, bigger technology.
S: That makes sense. Speaking of job, Hotel Tonight. Why Hotel Tonight? You have a lot of experience, you could pretty much pick where you want to be and you chose this place.
J: Yeah, one of the things that I really like about Hotel Tonight is the open environment. Is the influence and impact that engineers can make at Hotel Tonight to the entire business. Looking at that email, like I said everyday and then doing something about it, not just seeing that. That is something that I saw right from the get go. I've been in the consumer space for about ten years now. Worked for a couple of startups before-
S: As a programmer.
J: Before Hotel Tonight, I was one of the founding engineers at a company called MUBI. Was one of the first engineers there. I worked there for about eight years. I realized that there's so much more to engineering than just writing code and I saw that. When I looked at Hotel Tonight back then, I saw this as another startup really, which was not a startup in the sense of the numbers and the scale that this was running at but in terms of how it was running inside, it was still pretty much a startup. People really open, everyone had an opportunity to make an impact. That's really why, back then, this is about two years ago, I joined Hotel Tonight.
S: The openness is a major thing for you, interesting. When it comes to engineering culture, specifically, let's say an engineer on your team has an idea for a feature, what does that process look like if he/she wants to implement it?
J: Yeah, sure. Let's say I have an idea. I had a number of ideas in the past here, the number one thing is how quickly can you demonstrate that that idea has legs. I think there are two types of innovation. There's a tech innovation that you can get into where you're improving the tools and technologies. The other types of innovation is something that we need to also figure out how to make money out of them. Whether that's a feature and how that feature fits in. You don't want to end up with a patchwork where you have a thousand different ideas from different brains.
Coming back to the idea. Proving out really quickly, how that is going to help our customers. Customers on the mobile app site would be people who want to book hotel rooms. On the supply side it's the hotels. How does that work within the existing framework that we have. We're a mobile app, we do mobile-only. We're on iOS and Android and mobile web. If you look at our product, it's very very simple. How do we keep that simplicity is one of the things that, when you have an idea, you have to figure that out.
S: Got you. Awesome. It seems like there's DIY, do-er culture where if you have an idea, you just make it and if it works out, if it's profitable it can happen.
J: Yeah.
S: Cool. In term of engineering teams, what is the structure here. Do you guys have small teams, big teams, what does that engineering structure look like.
J: We are strong believers in small teams. Whenever you're working in a project, we strongly believe that there should not be more than five or six people on that project. Those are feature teams that we form whenever we embark on a project. In terms of structure, the team that I lead at Hotel Tonight, it's platform team. Platform team is like the back end infrastructure, APIs. Any web-based products that we have, this is the team that handles that. There's a mobile team which includes iOS and Android and there is QA and the product team. This is within engineering and of course, outside engineering you have supply, marketing, finance and so on.
S: Got you. How many engineers is it?
J: We have roughly about forty engineers including mobile, QA, platform.
Yonas: What's the biggest team out of that?
J: Platform is one of the biggest teams. Platform has roughly about fourteen, 15 folks now.
Y: Nice. Even split between Android and iOS?
J: Even split between Android and iOS, yeah.
Y: Interesting.
S: What platform would you say gets the most attention dedicated to it these days? iOS, Android, web, out of those three.
J: For us, definitely iOS. That's the platform that we actually started Hotel Tonight on back in end of 2010 but we're always looking for opportunities or partnerships on Android as well as on mobile web.
To give you a little bit of background, last September or last August we launched a mobile web product to book hotels. Back then, you could have only booked hotel rooms on iOS or our Android app. After that, you can now book on any mobile device but it's still mobile-only.
S: You were here for the launch of iOS initially or was that before you?
J: That dates before me but I can give you a little bit of background there from the stories that I've heard.
S: Yeah, what was the MVP?
J: It worked, it was the best decision back then. Within two months, the team back then launched an app. I think Titanium is what they used and a lot of CoffeeScript.
S: Oh my God.
J: Within two months, the Rails backend to an interface for hotels to manage their inventory and also in terms of staff. Within two months they were able to launch it and go to market was really fast. That was back then. That was the iOS app and Android I think back then they used HTML5 and jQuery. Eventually over the next few years, what we've done is we formed iOS and Android teams, really, really good experts and have converted those apps to native.
S: Awesome. That seems to be the way to go these days. Something that really interests me is data storage. I know it can be different amongst your platforms but how do you guys store your data?
J: It has evolved over the years. Back when we started it was a single MySQL database sitting on some EC2 Instance. Over the years it has evolved. We still use MySQL. That's one of our go to storage platforms but definitely have evolved that into more partitioned and more shadowed approach. Other things that we use for storage are Elasticsearch, Redis, Memcached. Different types of storages. Some of them are persistent storage, some of them are caches. We use a lot of queuing systems. IronMQ is one of our go to tools for interservice communication. For ETL, for data warehousing and stuff we use RedShift and there are other databases that we use like Postgres for the geo support.
MySQL is where most of the stuff started and like I said, storing state is one thing that is challenging when you're scaling. You can horizontally scale your app service or web service but storage is what ultimately gets you. That's where we've been evolving our single database into removing single point of failures from a stack and one of them has been MySQL.
S: Now you guys just have multiple points of failure.
J: (Laughs).
S: That's awesome though that you guys have specialized data storage for different things. Usually it's one or the other. What backend languages do you use to deal with the storage and retrieval?
J: Ruby on Rails is one of the platforms that we are heavily invested in. That's how we wrote our first app back in 2010.
Y: The first app for the Appcelerator?
J: Yep, the backend for the APIs and the hotel interface that we built, but overtime we've evolved that into more Rails apps and also recently, we've started looking at Go as tiny services to glue ... You have to get information from one place to another so instead of putting that code in the monolith that we've been trying to break apart, why don't we try something like Go. Node.js and other platforms for mobile web that I mentioned a while ago, Node.js is another platform that we started investing in.
Y: What percentage of the code base would you say is Ruby?
J: I would say it's at least 80%.
Y: All the APIs, all Ruby?
J: Oh yeah.
Y: Objective C, Java?
J: Objective C, React, the mobile web app again. Also, the interface that we have for hotels, that app uses React.
Y: You have a different app for the hotels.
J: Oh yeah, we have an app. When I say app it's really a desktop and mobile web app, not a native app but that and the mobile web app for the consumers, that uses React and a combination of Node and Rails.
Y: React, Node and Rails, okay.
S: What do you think of Go? I really like Go, are you a fan?
J: Again, going back to the point of right tool for the right job, it could excel in a lot of things but if we're trying to write a CRUD app, no, that's not where I would go.
S: You just want to stick to JavaScript.
J: Stick to something like Rails for a CRUD app, for a simple app. Go would be good for, you have a specific need or you want to get data from one place to another you want something really really fast. You don't want things like garbage collection to come in the way or multiple layers of framework to get in the way, that's when you want to get down to something like Go.
S: I like how detailed you are when you think about partitioning different problems and using technology for that specific problem. Is there a really hard engineering problem that you've solved recently?
J: There a couple of engineering problems that I can talk about. Let me set the context for the first one, first. Back in 2014 we used to be a same day service for booking hotels. As a customer you could only book hotels the same day. You could not book in advance. What Hotel Tonight would do is we would push, we would send a push notification at 9am to all our customers. It's a self-inflicted DDoS. Suddenly at 9am you see such a spike of traffic, customers trying to come to the app and trying to see what's available. That was really interesting because that actually put a lot of pressure on that database. A lot of pressure on the database and there was very limited caching that we did back then.
S: Because of the actual push notification or because they were starting to use the app?
J: All of them were trying to come to the app at the same time because of that push notification, and it turns out that we don't have one specific algorithm for all the customers. We have different number of factors to choose, to figure out what hotels do we show to a user. We don't show hundreds of choices to the customer. We show them only 15 hotels. Location is one of the very important aspects of the context for the customer. If a user is opening the app from L.A. versus San Francisco, they're going to see different results. What that means is the API powering that has limited cache ability.
That was an interesting challenge. There were these big days like July 4th, Memorial weekend where people would be opening the app. Way more people would be opening the app at the same time. That was an interesting challenge, so what we did back then was introduce a service called Fastly. We started using them. There are some APIs that cannot be cached for a very long time but what we did was we started caching, for example, one of these APIs which powers the hotel list for a very limited time and also, the caching was based on where the user was.
Y: You could do geolocation.
J: Yes.
Y: Geo-based caching with Fastly.
J: Yes. We are huge, huge fans of Fastly. I've used Varnish in the past. The amount of horsepower it gives you, edge caching is ... That's where you should stop. If you can solve a problem by introducing something like Varnish you should absolutely use it. Recounting our experiences from back in that time. There were other APIs which could be cached way more aggressively and I remember when we introduced Fastly, we cut down our service by about 3x or 4x, just by introducing Fastly.
Y: Response time.
J: No, the number of servers. Response time was cut in ... Fastly responds, depending on the payload size of course, we've seen responses in 40 milliseconds, 30 milliseconds, depending on where the user is too. They have the network of nodes which pretty much coincides with where our customers are.
On the other hand, APIs, the request has to go through the entire stack. It has to go through all the load balancers, Rails, all the databases. It cuts all that.
Y: Where is the core app hosted.
J: We recently moved to AWS. We used to use a platform with a service called Engine Yard. I think earlier this year, we started ... We have been using AWS for a long time, for the web container and for the background containers. We moved to AWS, started using ECS, Amazon EC2 Container Service.
We dockerized our app and run it use ECS which is like a scheduling service for containers. That transition has been pretty good too.
S: Yeah we'll have to circle back to that. That sounds interesting. Coming back to Fastly, you were reducing the number of servers on Engine Yard, right?
J: Right.
S: When you introduced it?
J: Yes.
S: How did you find out about Fastly and how did that decision come about. What was that process like?
J: It was not a hard process. I think we had a huge number of fans in the engineering team of Varnish. We had seen the power of Varnish. On the other hand, we had been trying to get away from self-hosting tools. For example, MySQL, we used to host it on EC2 instances. We moved away from that and started using RDS. Similarly, we were looking for a service for Varnish. Is there anything out there which provides Varnish as a service. Fastly was a name. We tried looking around.
Y: Started Googling. Started StackSharing. I love it. Were there already folks internally that already heard of Fastly and were saying "we should go with Fastly" or consider other options?
J: I think we stopped at Fastly and there was some discussion about CloudFlare but I think there were some common friends too at Fastly. I think typically, when we're evaluating tools, it just goes back to the question about, if you have an idea, how do you go about it? Let's prove it out. Let's sign up for a free trial and see what works and with Fastly it was so simple, all we had to do was change our Route 53 DNS to point to Fastly and change the TTL, specify TTL or cache header on our responses. That's it and it was up and running within a few days.
Y: You started routing some of the requests, you didn't do an automatic... or overnight.
J: That's right. We started rolling out some of the less critical traffic first to make sure that everything was working fine but, over time, it has become one of our go to tools.
There's only so much you can cache so we solved the 9am problem by making sure we're caching aggressively during that 9am. You cannot cache for a long time, especially something like the hotel list because the business that we're in, we're last minute. People come to us to look for hotels at the last minute and at the last minute, hotels are changing their price and allotment all the time. They're changing every minute so that means we can only cache so much.
Then we started looking at what was the bottleneck in the stack? That was a reason for not being able to scale. One of the bottlenecks was MySQL. When a customer would come in, we would basically ... Let's say there are millions of inventory records in our database. We'll run a geoquery on top of them and try to pick the 15 best hotels, out of millions of records.
S: Why the 15?
J: Why the 15, because we believe that choice should be limited for the customer and just by showing them the best 15 hotels, it makes it easier for them. We also make sure that there are a different variety of hotels, different neighborhoods and a lot that goes into algo. We used to do some of that in MySQL and a lot of that in Ruby and that was a bottleneck. MySQL was becoming a bottleneck.
We started looking around at different technologies. Elasticsearch came to our attention. A lot of people had been fans of other later technologies like Solr and so on, Lucene for ranking hotels seemed like an interesting idea. We explored that and now we have Elasticsearch in our stack.
Y: Where are you hosting it? Yourself?
J: We use a Found for hosting Elasticsearch. One of the things I want to mention there is with the combination of Ruby and MySQL, picking up the 15 best hotels, that used to take, depending on where the customer is, depending on the supply level, it would take anywhere from 500 milliseconds to 1 second.
Y: It's going through active record right?
J: It was going through the ORM for sure, and then MySQL and then post processing, a lot of post processing in Ruby to mix and match the hotels. The same thing that we did in Ruby and MySQL now it takes 15 milliseconds with Elasticsearch. That was one of the interesting challenges.
I could go on and on about Elasticsearch because then we discovered another problem with Elasticsearch. It was like "okay, this is great. Are we done here?" No, turns out, this is on the consumer side, a lot of consumers are coming in and trying to see a list of hotels. Now on the hotel side, we've worked with a lot of hotels that are changing their inventory, they're changing the availability and rates very often. What would happen is now, the data has to propagate from MySQL to Elasticsearch.
Y: By the way, they're inputting that data right? You're not scraping or anything like that?
J: No, they're inputting that data. We also directly connect with them so it's also like APIs that we work with. That means the data is changing a lot of times and that has to propagate to all of the storage systems, including Elasticsearch. Wow. That's a really write-heavy load that we have for Elasticsearch. Typical Elasticsearch usage is around logs and analysis. This was unlike that. The number of documents may not be that large, but the fact that they're changing all the time introduced a number of behaviors that we were not aware of when we started using Elasticsearch.
Elasticsearch would drop some of the writes silently. If you use on of their APIs called bulk update
, you can watch for those errors. Back then we were like "oh, it's going to scale, Elasticsearch scales right?" Then we discovered them and now we make sure that we throttle the writes. We have a queue sitting in between Elasticsearch and our systems.
Y: What's the queue?
J: IronMQ.
S: It will just know when it needs to be used or is there a human behind it? Like an engineer who's watching and waiting.
J: We also have a lot of monitoring built into these queuing systems, pretty much all the systems. We use Datadog for a lot of metrics and all these services. If the number of messsages in that queue has gone beyond certain threshold then alert the person on call.
Y: Datadog is like your dashboard right?
J: More than dashboard, I think dashboards are fun to look at but a lot of times you don't get meaningful insights out of them, or actionable insights out of them. Action is the key. Alerting is where Datadog really excels for us. We've set alerts for our systems. We're starting from load balances to all the databases to all the caches. We have checks and balances to make sure that the on call person is alerted. If, for example the CPU of the RDS database goes beyond certain limit.
Y: It sounds like a lot of the tooling is enabling you to have that approach right? You can say we don't need someone that's just DevOps and can work with Docker. A lot of the tooling sounds like it's helping with that right?
J: Absolutely. ECS makes it super simple and we use Terraform on top of templatizing our infrastructure as code. All of us can read what the infrastructure looks like and back in the day there were a bunch of systems in place and there was very little documentation about that and now the infrastructure as code, like Terraform.
There's a lot of innovation happening in that space. There are a lot of cool things happening in this space.
Y: Do you guys use CloudFormation at all?
J: No, we don't use CloudFormation. Terraform. It provisions all the dependencies that we have. All the elastic cache and RDS and so on. It's great. We also have multiple staging environments. It's pretty much the same recipe that we can just spin up and down, spin these new environments up and down.
Y: What does your build, test, deploy flow look like? Is everyone using Docker locally?
J: Not at this point. Some of them are but that's one of the things that we are looking to evaluate next.
Y: Are you using Docker, personally?
J: Oh yeah. Yeah. Again, there are certain dependencies. What we're trying to figure out as we are growing the number of services that we are, how do we extract out interfaces to external services so that's one of the things that we are looking at.
Our typical workflow is this. We use Solano. As soon as you check in something, as soon as you clear the PR or clear the branch, it goes though Solano CI and makes sure that everything is cool.
All the tests pass and make sure they run within few minutes, rather than 30, 40 minutes. That's where the power of Solano CI comes in.
You create a PR, tests run on Solano. There are other folks reviewing on the team, review your PR then you merge that PR to master. Then Jenkins and combination of Solano kicks deploy to staging cluster and that's when, depending on the feature, we do some sort of manual testing to QA then we get it out to production.
Y: How does Docker fit into the flow of it? Once you push it out to GitHub then it's being containerized then put into Solano?
J: Yes. It's being containerized. We use Jenkins and create those Docker builds and create the images that we need. With ECS and then ECS, all those task definitions that we have defined. So that's our build pipeline.
Y: How long do the tests take to run? I'm just curious.
J: If you run it locally, it's going to take a while. That was one of the things where we started exploring, I think a couple years now. Circle CI and Solano. I remember on Solano, I think it takes 5 and a half minutes.
Y: Wow, that's pretty good.
J: This is, again, we still have a monolith that we're trying to get away from. As far as platform is concerned and that's going really well. Just looking at different things that we can extract out and encapsulate in a different service.
Y: Does mobile have a different process there in terms of build and deploy?
J: Knowing the platforms, like iOS specifically, the approval cycle of once you submit an app. Every two weeks we try to get something out, a new release and hot fixes, of course, as soon as we discover something. We like to have a cadence on getting a release out on those platform. Not feature driven releases, it's more of a release frame that we have to get something out.
One of the things that I didn't touch upon, going back to challenges. We solved the customer side by introducing caching and moving to Elasticsearch and also writes that were happening with Elasticsearch but remember when all those hotels are changing their rates and availability, all those were still going to a single database. There was a hot spot there. That was one of the places where we looked at some of the solutions out there on how different companies are doing it. We built a schema-less append-only store on top of MySQL. Pretty much on the lines of what FriendFeed ... There's a pretty good blog post by the FriendFeed guy, this is like 2009. Then Uber recently, also moved to the same model, schema-less append-only store.
That allows us to horizontally have more partitions and shards and also just have it schema-less. Schema-less and append-only is the key. Storing different versions of the same room or inventory, so you're not overriding. You're not overriding a single row so you're basically creating different versions of it. Then schema-less, because there's stuff changing all the time so you don't have to go and change. You don't have to run online schema changes.
Y: Wow. That's awesome.
We still have a few minutes. Any other tooling and things that have been helpful? We don't have to touch too much on the mobile side because you're on platform but there's a lot of talk about some issues with Ruby specifically. I know you mentioned Go, but are you guys looking at any other new technologies or have you guys had success with Ruby and you're able to have it perform at a high level?
J: Ruby is performing pretty well for us. There's some use cases that Go is something that people have started looking at. There are a lot of Ruby enthusiasts in the house, including me, who's been in the Ruby community since 2006 and Elixir is one of the things that a lot of the Ruby communities going into. There are a lot of people on the team who are exploring Elixir.
Y: Are you a fan of it? Have you played around with it?
J: I haven't played around with it.
Y: Okay, so you're happy with Ruby.
J: I am happy with Ruby but at the same time, my schedule is, I would not even say 50-50 between coding and managing. It's even more on the managing side now. Making sure the team culture and projects are getting done. Infrastructure really excites me. It excites me more on, like I said, scaling. Especially on the storage systems. You can scale Ruby, you can add more web servers right but scaling state is really challenging.
Y: Yeah, it sounds like a lot of the issues that you have are because of the fact that there's a lot of writes on the supplier side but I guess on the customer side you don't have those sorts of issues right?
J: We do. Not a lot of them. We moved away from sort of the "callback hell" that people would run into, in the Rails models.
Y: after_save
.
J: after_save
and it was a long list of those callbacks in some of the Rails models that we had. We moved away from that and started backgrounding them as much we can.
Y: Bunch of rake tasks.
J: Not rake tests but using something like Sidekiq, just queuing it offline. The next step we've started looking into is event streaming. So what would happen is, let's say a customer changed one of their attributes and now the rest of the systems need to know about that change, previously we would have the first naive implementation of that is you have, 5 or 10 different callbacks which do the same thing. The other way to look at that is producer and consumer. Producer is going to produce that and put it into a queue, like Kinesis or Kafka and then all the consumers in the world are going to consume that and take an action on that trigger. That's the model that we are moving into. Conceptually, it's just easier to understand what's going on in the system.
Y: Yeah, that means you have one central place where you can see everything that's happening.
J: Right.
Y: You're now introducing Kafka or you've had that already?
J: We are evaluating Kafka and Kinesis right now. Kinesis is something that the team ... It's a service by AWS so you don't have to worry about some of the hosting challenges that you have to take on with Kafka, but at the same time we're trying to compare what other edgecases we will run into with both Kinesis and Kafka. Taking that call, then deciding should we use Kinesis or should we self-host Kafka and go after that.
Heroku recently launched their Kafka as a Service so that's one of the things that we're also looking into.
Y: Interesting. Nice. I think now we've gotten through everything right. Looker is helpful for you all?
J: Looker has been very helpful, for sure. I think how do we democratize ... We have so much data, how do we have access to that. Not just me, I can go and run SQL queries. The rest of the team, the rest of the company needs easy access. Looker is one of those tools that is designed to do that.
Y: One thing that popped in my mind as you described the push notifications at 9am. Can you briefly describe what happens if a server goes down in the middle of the night? What's the flow? Are you guys using PagerDuty to get a text message?
J: We have the platform team. Each member on the platform team goes on call for a week. There's a primary and a secondary, the primary gets paged, whatever time it is and if they don't answer, it goes to secondary and then I am a 3rd level. I'm also on the rotation every 14th or 15th week but I'm also as the 3rd level if there's no one on the call answering then I get on.
Depending on the notification settings, we definitely want people to get a call, rather than a text and depending on the nature of the problem. If there's an alert from database, then better look at ... That's a call. We've been trying our best to reduce the noise there, through alerts.
All the noise. Making sure that high alert, critical calls are getting paged and I think the next step that we're looking at is how can we take alerting to the next level. Not just alerting, but actually figuring out a way to fix that problem right then and there. For example, let's say our queuing system is backed up, can we spin up more workers on the fly, rather than having an on a call person.
With no intervention. That one needs to be done carefully because you don't want to have to spin up a bunch of more workers which end up bringing down your database.
S: It sounds like this could be machine learning.
J: Yeah, anomaly detection is one of the things that machine learning is... a lot of companies are using it. I think Sumo Logic was one of the companies that was showing it. They came on site a few weeks ago and some of it seemed really interesting.
S: There's so much more to talk about. There's so many questions I have for you but I know your time is limited and we want to wrap this up. For me, from what you've said about Hotel Tonight, it seems like a very cool culture and some place that if I were applying for jobs, I would apply, but one thought I had was for the industry in general. You know there's a sharing economy in Airbnb and all these sharing services. What do you see as the future of the accommodations industry or hotels. Do you think people will continue to want hotels or do you think it's going to move this way or that way? What's your take on it?
J: I think the different use cases, depending on where the customer is in their life cycle. If you're going on a vacation, if you're with your family and planning your vacation a year in advance then I think some of the accommodation services like Airbnb excel. A lot of other use cases, if you want to be spontaneous, if you're on the go like me, I'm traveling a lot of times from South Bay to San Francisco at 6pm I'm just like "hey, should I just call it a night here or should I extend my night and book a hotel room here and go back tomorrow?" There are a lot of those serendipitous experiences that Hotel Tonight unlocks. That's a use case as the millennial generation where the advent of mobile phones. We're just getting started.
S: Just a parting question. What's something that excites you, something that you're looking forward to? It could be technical or it could just be life in general, like I have a dog that I'm gonna buy or something. Something in the next day or next week or maybe even next month.
J: All right. Personally I like scaling. I get high when I'm working on a problem which is around scaling. It could be a storage system or just scaling to the number of users. Growth and I've asked that question to myself, what excites me. Scaling a company is what really excites me underneath. Growing the customer base, how to retain users, you could be doing 100,000 RPMs, but if your users are not sticking ... Retention is another thing. Combination of both business as well as technical is what excites me.
S: Awesome. Well thanks so much.
J: Yeah thanks a lot. Thank you guys.