Algolia started out as a mobile SDK for apps that wanted better offline search. After customers starting asking for a hosted search product, they quickly realized that it could be an opportunity for them to move beyond just mobile apps. They now power search for a host of web apps including Pebble, WeFunder, CodeCombat and HackerNews. We sat down with them to learn more about their search product and the technology behind it.
The interview is divided into two parts:
C++ Ruby Rails MySQL Redis Google SparseHash
Amazon EC2 Amazon S3 Google Analytics Google Apps KissMetrics Amazon CloudFront Amazon EBS Amazon Route 53 Twilio Amazon RDS Google App Engine Hipchat Serverdensity Pingdom Github Intercom Stripe OVH Dropbox Codeclimate Travis CI Coveralls Mailjet Mailchimp
Leanstack: Let’s just start with how Algolia got started. Where did the idea come from?
Julien: It all started with consulting missions for startups in the Bay Area. One of them was using Google App Engine and they were developing their search feature using App Engine indexing. The problem is that App Engine only provides a classical document search engine and is not adapted to perform search “as you type” on small records, stored in databases, such as names of people or song titles. As I had developed three search engines in the past, they asked me to develop a very small search data structure that would fit their needs.
Nicolas: Julien has been working on information retrieval and natural language processing for more than 10 years, and me, for more than 15 years.
LS: So you had been working in search for a long time and then you thought there needed to be a better way to do search?
N: What Julien built for this consulting mission was a very lightweight search engine, and we realized that it could be a very good fit for mobile applications. This intuition was confirmed when we saw on Stack Overflow many unanswered questions about embedding a search engine directly in a mobile app. People wanted to embed Lucene and didn’t have any success with that. That was our “go” to start developing this search engine SDK embeddable in mobile apps. So that was actually our first product: an offline search engine.
J: The original idea was that developers could use this SDK in their mobile app. They had to be able to integrate it in their applications in only a few minutes without any specific search background.
N: So we started the company with this product in 2012 with iOS, Android and Windows Phone availability. It actually was a huge tech success, people loved it, but there was just no market for it.
LS: What kind of companies were using the SDK?
N: It was basically consumer apps, mainly apps with offline data. For example, a travel guide. Let’s say you are traveling to Paris and once there, you want to be able to look for local information but you don’t want to pay for the roaming fees. You simply download the guide on your mobile device and once you are in Paris you can search it offline. In that case it brings a lot of value.
The problem we had was that developers, especially on Android, were not ready to pay for an SDK. Few apps rely on offline data and they prefer to have a very basic search using SQLite.
N: At the same time, we got a lot of feedback from people who loved the experience but they wanted an online version of the engine so they could offer the same search experience on all their data.
We had built the search engine with mobile in mind, search needing to be instantaneous and tolerate typos which are so frequent on smartphone keyboards.
But this search experience is something that is incredibly difficult to build for any developer, not only those working on mobile apps. So they really saw the value of an easy to use REST API on their online data.
That was basically the time when we decided to build this SaaS version. It was early 2013. We started to build it in March and got a lot of very good feedback. We discovered that we had a solution for a problem that was very common. So we started to dedicate our work to this differentiation, targeting databases rather than documents.
LS: So it was really because you started on mobile that you made the decision to focus on the databases as opposed to documents, right?
N: Correct. What had led us to mobile was the nature of the data we were searching for. On a mobile device you search for contacts, for places, ... You don’t search for big documents because that is not the way you use a mobile device. This is what we optimized the engine for and we used the same core technology when we moved to the SaaS version. The fit was really good to perform full-text search in a database. Our core value, the value of the engine, is that it was designed and optimized for database records. Everything we have built ever since was always done keeping in mind that our target was databases.
LS: Can you explain the differences in terms of experience? Focusing on databases results in faster queries, can you talk a little more about the difference between the two and how you guys view the two?
N: Yes, there are not so many people talking about the two kinds of search, so it’s important to really explain the difference. Most engines have been designed to search in documents.
What that means is that when you want to rank results, you are using ranking rules that have been designed for documents. This ranking counts the occurrences of the terms of the queries inside each document and uses statistics based on TF-IDF. You want documents that contain many times the query terms you typed to be ranked higher. And this is a statistical formula that will generate a score for each document. This score is a bit cryptic, you can tune the formula, adapt it a little bit. But in the end you get a score that is very difficult to understand. It’s very difficult to tune a working formula for a document search engine. And that works well with full words but it doesn’t work well with prefixes, like when you are not typing the word completely.
So you need to do some hacks to get something approximately correct.
J: It’s because it was designed to search in big chunks of text, such as web pages like Wikipedia where you have a huge amount of text, for which these statistics make sense. In our case, on databases, it's very different. If you are searching for a simple product name, you do not have such constraints.
N: Because the data is different, we needed to come up with a totally different way of ranking the results, closer to the way people think. For example, if you search for an iPhone on an e-commerce website, you just want to find the word iPhone in the title and not at the end of the description. So the position of the word in the results is much more important than the number of occurrences.
You also don’t want the iPhone 3G, you want the latest iPhone 5S, the most popular one. So the advantage of having this database focus is that we can consider this kind of information, which is very difficult to do with a document search engine.
You also want to take into account other factors such as did I misspell my query? If you are an e-commerce company you may want to get the best sales first in your set of results. So you want to combine all these together and this mix is too complex with a document search engine. So we have developed a different approach that takes all these criteria into account in a very explicit way, everyone can easily understand why you get a certain set of results and why they were ranked that way.
LS: And a lot of this has to be customized, right? A lot of this is very specific to your app. How do you deal with that?
J: We have a customization that is very easy compared to other engines. We have a few settings compared to several thousands that you can customize in other solutions. So this is easier to customize and we also explain to our users how to get a good search experience on their data.
N: We consider that our default ranking is a good fit for 90% of the use cases. You basically have two settings to configure to have a good search:
1) Specify the importance of the attributes you want to index. For example, the names of products being more important the descriptions or the comments.
2) Define the popularity of the objects. For example the number of sales or the number of likes.
With these two settings you basically get an engine that is already extremely good in terms of relevancy.
J: For example one of the use cases that we show on our demo page is a search engine for TV series. The first setting specifies that the name of the TV show is more important than the the name of the actor and the second setting specifies that the popularity is defined by the number of followers. With these two settings, you get relevant results at the first keystroke.
N: And if they want to go further, such as showing the matching categories as you type, of course you can get advanced settings for the API. We also have an administration interface where you can configure every setting online.
LS: Very cool. So let's talk about the tech behind it.
J: So basically we have two big parts: an API that is used by our customers. And we have the website with the administration interface. These two parts are very different in terms of architecture.
So if we look first on the website and the administration interface, we are using Rails with bootstrap for the UI. We are using AWS for the hosting with RDS (MySQL) and EC2 in two different availability zones with an ELB for load balancing. Our AWS instances are located in U.S-East and we are using CloudFront to store all assets in order to provide a good experience anywhere in the world.
On the API servers, we store some real-time data using Redis and keep logs of the last API calls for each user. We also store the API counts in real-time, so each time you perform an action we increment a counter in Redis, to be able to give you this real-time information.
LS: Can you talk a bit about how the data is stored?
J: When you send your data to our API, we replicate the data on three different hosts which are the API servers. So indexing is done on three different hosts. They use a consensus algorithm to be sure to be synchronized, which is RAFT. It’s a kind of Paxos algorithm.
It’s the same kind of algorithm Amazon S3 or Google App Engine use. You need to be sure that your three different clusters are fully synchronized. Even if one is down for some time, even if you have a reboot or anything happens, you have to be sure they are fully synchronized.
When you send a write action to the API, we will send it to the three different servers and we will affect an ID of transaction to this write. This ID is an increasing integer across the cluster of three hosts decided via a consensus. The API call returns a taskID when the job is written on at least two hosts. You will be able to check when the job is indexed using the taskID.
N: You’ve got to ensure not only that you have replicated the data, but that it is in the same exact order, so these three hosts must always be synchronized and in the same state.
J: We are doing that for high availability. And even if one host is down, even if our provider has an availability zone which is down, the service needs to be up and running. So exactly like S3, we replicate three times the information to be sure you always have something running.
We also use this redundancy for performance because we share the queries across the three different hosts. For example if we receive one thousand queries per second, we send 33% of these queries to each host.
Our search engine is a C++ module which is directly embedded inside Nginx. So when the query enters Nginx, we directly run it through the search engine and send it back to the client.
There is no layer that slows down the process. It’s extremely optimized for the queries.
LS: So would you explain how the data flows end-to-end?
J: We have different clusters in different data centers, each of them is composed of three different hosts. Most of our clusters are multi-tenancy and host several clients.
These are very high-end machines. It’s basically a server with 256 Gigabytes of RAM, about 1T of SSD in Raid-0, 16 physical cores >= 3.5Ghz. We are not relying on EC2, we are using dedicated servers and several providers.
We also have very good bandwidth associated to these clusters: we usually have 4.5 Gbps of dedicated bandwidth per cluster because we really want a high quality of service. We did not make any compromise on the hardware and the bandwidth in order to offer the best service.
Each index is a binary file in our own format. We put the information in a specific order so that it is very fast to perform queries on it. This is the algorithm that we have developed for mobile.
Our Nginx C++ module will directly open the index file in memory-mapped mode in order to share memory between the different Nginx processes and will apply the query on the memory-mapped data structure.
N: It’s not rocket science, but it’s a nice way to avoid a SPOF in a hardware load-balancer. All our API clients are released on GitHub under the MIT license, so everything is accessible on GitHub.
LS: Is there some sort of queue for the index jobs?
J: It’s not exactly a queue, but it is a consensus. Each server has a builder process that receives indexing jobs from nginx. When you add an indexing job, the nginx that receives the job sends it to the three different builder processes and launches a consensus. It’s a vote between the three different builders to have an unique ID. The consensus is based on an election algorithm, at one point in time, we need to have one master builder and two slave builders. If the master is down the two remaining slaves will automatically elect a new master between them. The master performs the unique ID allocation and moves the temporary job in the indexing queue on all the builders. When two out of the three hosts have finished this operation, we acknowledge the user. There is an open source software that you can use for this kind of ID allocation, the most popular being Zookeeper. Zookeeper is a small layer that you can use to have distributed information in your cluster. So you can, for example, share the information on who the master is in your cluster.
The problem with Zookeeper is that a change in the topology can take a lot of time to be detected. For example, if a host is down, it can take up to several seconds to be detected. This is way too long for us, as in one second will have potentially thousands of indexing jobs to process on the cluster and we need to have a master that attributes an ID to be able to handle them. So we have built our own election algorithm based on RAFT. With this algorithm we ensure that we have one consistent ID allocator across the three different hosts. And if this leader is down, you elect another one within milliseconds.
The deployment process is very easy. We kill the build process and do a hot reload on Nginx process via signals. By killing the build process, we automatically test our election algorithm all the time and do not wait to have a host down to test it. This approach is similar to the Chaos Monkey approach of Netflix that stops production process in a random mode.
LS: That makes sense. So that takes care of how data is indexed and then also queried.
J: We have a third part in the architecture which is not important for production, but is important for testing.
LS: Yeah, I was going to say we should probably get into your build, deploy, and test process.
J: We provide a status page of the API with realtime metrics about the service. To do that, we have developed probes running on Google App Engine. These probes are performing indexing and queries all the time on all our clusters and they automatically detect any kind of problems.
We also offer a dedicated status page for specific users. For example we have a specific status page for Hacker News. That’s also the case for large enterprise customers who want their own infrastructure. If they want dedicated servers, we provide them with a dedicated status page as well.
And there is deployment: there are three of us pushing code to GitHub. We have continuous integration mainly for the API clients with a set of unit tests and non-regression tests. All the tests are applied at commit time using Travis CI, we are also using Code Climate to test code quality and Coveralls for code coverage.
For production, we have an automatic deployment script that applies a big set of unit tests and non-regression tests before deployment. We also have an automatic script for rollback operation if a problem is found with the new version.
LS: And then do you have a staging environment?
J: Yes, we have a cluster which is just for testing. And we have several kinds of production. We have critical production clusters and others for staging.
N: Depending on the importance of the updates, we will not start with the most critical clusters.
J: We haven’t had any search downtime for the moment. We have only had one indexing incident that lasted eight minutes since our launch in September 2013. Since our launch we’ve served close to 4 billion API calls.
N: This downtime was just for indexing and not for searching and only on one cluster, so not all customers were impacted. Maybe 15 to 20 customers were impacted during those eight minutes. So we quickly reached out to these customers and only two out of the 20 had actually noticed the problem. And these two were really happy with the way we handled the situation.
J: We believe in transparency. We wanted to explain what happened and how we solved it.
N: So we wrote a post-mortem blog post.
LS: Do you gusy provide any logging?
J: We use Redis to keep the last logs of users, the last error logs, and the counters. In fact, on the API server we only keep the last activities of all our users. We then have a dedicated cluster that processes all the logs to provide statistics and analytics on the users’ dashboards.
N: There are two kinds of analytics. There are the raw analytics that give you the response times you can see in your dashboard today. You can get response times as well as the 99% percentile of response time, ...
And we are working on more high-level analytics on the content of the queries. The idea is to know what are the most frequent queries, the most frequent queries without results, ... We’re working on it and it should be ready soon.
LS: In terms of monitoring, what sort of server monitoring do you guys use?
LS: Any other services that you guys are using?
N: In each cluster, there are three hosts but we can also do search across many data centers (Europe, US, Asia). You can distribute this search on these different data centers. So the end-users would query the clusters closest to them. And we’re doing that using Route 53.
J: Yes, it really depends on the location of the end-user. Route 53 detects the closest server and directs the user to it. This is what we are using for the demos on our website, you always get a good response time.
We use HipChat internally and we have a specific room for the website that dumps all sorts of data such as exceptions, new signups, payments, ... All of that is pretty common, but we are also using HipChat for support. If you go to the website, you can chat with us. In the chat room many people can join at the same time in the same room. We can chat with the customers and can even past some code with highlighting. It’s pretty nice when doing support for developers.
LS: Are there any open source tools or libraries that are important and really helpful for you guys that you haven’t already mentioned?
J: We are also using quite a lot Google SparseHash library. It’s a very low-level and very high performance hash table that we are using in all the low level C++ code. It’s very efficient.
We're using tons of open source libraries, it would be quite long to list them all! :)
And then we are doing a few things to help the developer community. The last thing we have done is Depstack. It’s a tool to find and vote for popular libraries. What we are doing behind the scenes is we index all the different libraries we can find for different languages (Ruby, Python, Go, ...). Then, as a user, you can sign up with GitHub and vote for libraries. We also detect which GitHub projects are using other libraries and count a vote for them.
By doing that we are trying to help people identify the best and most used libraries.
LS: Very cool, sounds familiar :)
J: So we will continue to support the community with initiatives like Depstack. Oh and we also provide Algolia for free for community projects.