How Sqreen handles 50,000 requests every minute in a write-heavy environment

6,905
Sqreen
Sqreen is a developer-friendly security platform for web applications, API and micro-services

Sqreen is the industry’s first provider of Application Security Management (ASM), a unified platform for security, operations, and engineering teams to help them succeed in building and running secure applications together, without code changes, reduced engineering velocity, or added operational burden. To handle the needs of modern application security, Sqreen uses an agent-based solution that deals with a huge volume of writes.

Backend-related challenges can generally be split into two large categories with distinct kinds of solutions: Read–heavy and write–heavy. Read-heavy productions are much more common and generally easy to deal with. Write-heavy productions are much less common, and are more resource-intensive.

At Sqreen, our customers add an agent into their application’s code, using a package/gem/jar, which is started in multiple servers. For each process, an agent network session is started with Sqreen’s backend servers. This connection enables the agent to fetch security rules and keep them updated (a read problem), and send security events back for further analysis after they have been detected (a write problem). Usually our agent connects once, reads the specific rules, and then begins a long life in which they send us events (occasionally) and security usage metrics (a few times per minute). This means that the Sqreen backend architecture mostly has to deal with a write problem. We currently handle over 50,000 requests per minute, 95% of which are writes.

So how did we get to a stage where we could handle this sort of write-heavy load?

The early days: switching from MongoDB to DynamoDB

The very first PoC of Sqreen was created using Meteor. Meteor’s promise is to make the backend/frontend boundaries disappear. To achieve this, Meteor relies heavily on MongoDB.

The Meteor architecture makes it extremely easy to create early prototypes. Unfortunately, this system, while able to synchronize plenty of data between the database and browser, didn’t scale very well for us. We quickly outgrew Meteor, and created a dedicated React dashboard that got data from a Flask–powered REST API.

On the agent side, saving data in Mongo worked very well. In production, Mongo was deployed in a replica set that was spread across our AWS availability zone. Mongo also embeds a native way to expire data, which we used for many parts of our data model. This enabled us to keep plenty of fresh data live and easy to access, while also removing all unneeded data that got stale. In doing so, we keep the overall data size of Mongo under control.

Some data, though, cannot be made to expire. Every security event reported and user monitoring event keeps a value for a very long time. While security events are only generated upon attacks, and so have a relatively small volume, user monitoring events (for Sqreen’s Account Takeover protection) tend to accumulate quickly.

As people started to really use the user monitoring feature, we predicted that our simple first solution (i.e., simply writing them in MongoDB) would not be viable for long. We needed to split this data off to another system. The biggest criterion we had for this other system was that it should scale easily. Our new solution had to be ready to implement and stable, as we couldn’t afford to dedicate full-time personnel to it just yet.

We took a survey of available databases and became interested in Cassandra. Cassandra scales up very well by pushing the developer to think directly about how data will be sharded. However, while the database engine itself is quite stable, it mainly scales by adding more servers to the cluster, which places a big burden on operations. We would need a full–time person to handle the complexities of operating our own Cassandra cluster, which was a no-go at the time.

We then set out to find alternatives. We could, for example, deploy Cassandra as a managed service by some other provider or, it turns out, use AWS directly. Cassandra was actually directly inspired by a paper published by AWS that tried to solve the same problem some years before. The last descendant from that research is available as a managed service on AWS, called DynamoDB.

DynamoDB is a fully managed service that stores objects in a very similar manner to Cassandra, but in a less complex manner. DynamoDB fit our needs much better, so we made the change.

The actual code change from using MongoDB exclusively to using DynamoDB for our user monitoring data was not trivial. First, as the two databases are very different, the data model had to be reshaped. When working with Mongo, people generally work like they would with a SQL database, identifying entities to work on and persisting them. In DynamoDB, one needs to think in terms of the actual answers needed. From these, the queries can be induced and thus the shape of the model as well.

Also, DynamoDB auto–scales based on actual uses of the throughput. This scaling has some delays and any requests that happen to be over capacity will result in an exception being thrown into the application. Read requests can and will generally be retried -- the worst case scenario being that a customer refreshes the offending page on our dashboard -- but writes have to be handled a bit more carefully.

To ensure that all writes are always performed, we had to add a recovery system on failed writes. This system reuses our digestion system; it will push the offending object into an SQS queue to be digested (i.e., written) at a later date. We simply retry every few minutes until DynamoDB has had enough time to scale up. For these reasons and others, we progressively rolled DynamoDB usage as a principal storage solution across our customers. We made heavy use of feature flagging and experiment systems.

Blocking attacks in real-time with stream processing

Up to this point, most of what was required of the backend was handling reports coming from different agents. The agents detect and block attacks at runtime directly inside the application without back and forth with our backend, and then send reports to the backend afterward. These reports are normally consumed by humans (our customers) that want to be aware of attacks and proactively fix the vulnerabilities we identify in their production code. If we were to somehow have a small digestion lag, this wouldn't be much of a problem. However, by analyzing security metadata across our installed agent base, we found that we could go one step further to help our customers secure their application. In addition to blocking requests triggering vulnerabilities, we could also detect additional attacks based on the data collected inside the app and block attackers altogether.

Agents themselves can’t detect these attacks. Each agent only sees a segment of activity (generally, one single application process). A standard production environment is composed of dozens to thousands of these processes. To correctly detect such large-scale attacks, we had to merge the reports from all of the agents running for an application. Fortunately, we were already receiving these reports, pushing them to a queue, and digesting them independently. In order to detect large-scale attack activity, we "simply" had to work on all the associated reports together. As it turns out, there is an easy way to handle this: stream processing.

There were two main solutions for stream processing available at the time: Kafka or AWS Kinesis. Kafka is the industry standard for stream processing. It's very robust and very feature–complete. It's also known to be difficult to configure and operate, and there were no managed Kafka options available at the time. By comparison, AWS Kinesis is a managed system. It had all the necessary features we needed, although the consumer libraries (called KCL) are only directly available in Java. To use the consuming capability from other languages, one can either write their own consuming libraries or use a small Java bridge called the MultilangDaemon. The MultilangDaemon is less efficient than consuming Kinesis directly but is much faster to set up, so we started experimenting with stream processing using a mix of Kinesis, KCL through the MultilangDaemon, and Python for our digestion logic.

The final piece of software works on all attacks sent by each application and can then detect attacks by using simple accumulation algorithms. Once a large-scale attack is detected, we extract the major actors and emit security responses back to our agent. Said security responses will ban or redirect the offending actor for some period of time. The system was actually so easy to use that we made it available directly to our customers via a new part of the product that we called Playbooks. Playbooks explore a series of related events that satisfy a set of user-defined preconditions and respond as the user dictates. For example, our customers could easily block users that abuse critical features, which would indicate a business-logic attack.

The guiding principles of our current scalable architecture

Today, we have an architecture designed for scale, and we follow a few guiding principles to make sure that we continue to build scalable architecture in the future.

Cache when possible

We leverage caching where possible to handle large volumes of requests on our backend. Our security rules do not change very often in comparison to the number of requests sent to our servers, so we’re largely able to cache them. We’re able to slim this down further since most customers don’t need the full set of rules and don’t hot (re)load packages.

We use memcached for our caching needs. Memcached is an in-memory server that uses a fixed pool of memory to store any data you send it in a key–value manner. It is extremely fast — capable of performing ~200k reads per second¹ when well tuned, so it is much faster than calculating the ruleset to send each time.

Do as few writes as possible by using queues

For most of our data, we don’t need to perform a write immediately. We only need to ensure that it will be written eventually. We used the classic solution for this by implementing queues. When the part of the backend that talks with the agent receives a piece of data, we check that it looks OK and then repost it to a queue. Another part of the backend, that we are calling digestion, gets the data from the queue, reads it, and does the actual database insertion. We can then independently scale the part of the backend that responds to the agent and the part that is responsible for writing the data to the database.

When we chose our queue system, we explored a few options, and came down to Redis or an AWS core service called SQS. Redis is in-memory, which is risky for production, but SQS is limited to cloud-only. So we decided to split the difference. For local development, we use Redis, and we use SQS for production. Our dual setup here means that we can use a simple DB in local mode and a much more scalable queue system when deploying to staging and production.

Be super nimble to deploy

Since day one at Sqreen, we’ve focused on making deployments easy, so we can do them multiple times a day, without any perceivable impact to our customers. We use automated testing for this. We leverage Jenkins for some testing, and every commit pushed to GitHub in a PR has to pass through our test suite. Our test runners all publish coverage results that we collect and analyze using Codecov.

To facilitate deploying to production, we decided early on that what we are actually interested in is deploying services, not servers. As such, we chose to encapsulate our application code in Docker containers, with Amazon ECS as the orchestration layer. Our containers are actually built by Jenkins. Once the tests are all passed, Jenkins builds the new Docker image, uploads it to our private repository in ECS, and updates our staging, directly deploying this new image. At deploy time, the exact same image gets deployed to production, in true 12-factor app fashion -- only the environment will differ in production. If we ever need to scale up or down, ECS just needs to launch more or fewer of these images.

Sqreen’s scaling continues

At Sqreen, we’re scaling quickly, both as a company and with our infrastructure. Does creating a transparent security solution that security teams and developers both love sounds interesting to you? For more of these challenges and so much more, come join us in building the future of security at Sqreen!

Notes

¹ https://github.com/memcached/memcached/wiki/Performance#handles-extremely-high-load

unsplash-logoBanner Background Image by Taylor Vick

Sqreen
Sqreen is a developer-friendly security platform for web applications, API and micro-services
Tools mentioned in article