How Sqreen handles 50,000 requests every minute in a write-heavy environment

3,292
Sqreen
Sqreen is a developer-friendly security platform for web applications, API and micro-services

Sqreen is the industry’s first provider of Application Security Management (ASM), a unified platform for security, operations, and engineering teams to help them succeed in building and running secure applications together, without code changes, reduced engineering velocity, or added operational burden. To handle the needs of modern application security, Sqreen uses an agent-based solution that deals with a huge volume of writes.

Backend-related challenges can generally be split into two large categories with distinct kinds of solutions: Read–heavy and write–heavy. Read-heavy productions are much more common and generally easy to deal with. Write-heavy productions are much less common, and are more resource-intensive.

At Sqreen, our customers add an agent into their application’s code, using a package/gem/jar, which is started in multiple servers. For each process, an agent network session is started with Sqreen’s backend servers. This connection enables the agent to fetch security rules and keep them updated (a read problem), and send security events back for further analysis after they have been detected (a write problem). Usually our agent connects once, reads the specific rules, and then begins a long life in which they send us events (occasionally) and security usage metrics (a few times per minute). This means that the Sqreen backend architecture mostly has to deal with a write problem. We currently handle over 50,000 requests per minute, 95% of which are writes.

So how did we get to a stage where we could handle this sort of write-heavy load?

The early days: switching from MongoDB to DynamoDB

The very first PoC of Sqreen was created using Meteor. Meteor’s promise is to make the backend/frontend boundaries disappear. To achieve this, Meteor relies heavily on MongoDB.

The Meteor architecture makes it extremely easy to create early prototypes. Unfortunately, this system, while able to synchronize plenty of data between the database and browser, didn’t scale very well for us. We quickly outgrew Meteor, and created a dedicated React dashboard that got data from a Flask–powered REST API.

On the agent side, saving data in Mongo worked very well. In production, Mongo was deployed in a replica set that was spread across our AWS availability zone. Mongo also embeds a native way to expire data, which we used for many parts of our data model. This enabled us to keep plenty of fresh data live and easy to access, while also removing all unneeded data that got stale. In doing so, we keep the overall data size of Mongo under control.

Some data, though, cannot be made to expire. Every security event reported and user monitoring event keeps a value for a very long time. While security events are only generated upon attacks, and so have a relatively small volume, user monitoring events (for Sqreen’s Account Takeover protection) tend to accumulate quickly.

As people started to really use the user monitoring feature, we predicted that our simple first solution (i.e., simply writing them in MongoDB) would not be viable for long. We needed to split this data off to another system. The biggest criterion we had for this other system was that it should scale easily. Our new solution had to be ready to implement and stable, as we couldn’t afford to dedicate full-time personnel to it just yet.

We took a survey of available databases and became interested in Cassandra. Cassandra scales up very well by pushing the developer to think directly about how data will be sharded. However, while the database engine itself is quite stable, it mainly scales by adding more servers to the cluster, which places a big burden on operations. We would need a full–time person to handle the complexities of operating our own Cassandra cluster, which was a no-go at the time.

We then set out to find alternatives. We could, for example, deploy Cassandra as a managed service by some other provider or, it turns out, use AWS directly. Cassandra was actually directly inspired by a paper published by AWS that tried to solve the same problem some years before. The last descendant from that research is available as a managed service on AWS, called DynamoDB.

DynamoDB is a fully managed service that stores objects in a very similar manner to Cassandra, but in a less complex manner. DynamoDB fit our needs much better, so we made the change.

The actual code change from using MongoDB exclusively to using DynamoDB for our user monitoring data was not trivial. First, as the two databases are very different, the data model had to be reshaped. When working with Mongo, people generally work like they would with a SQL database, identifying entities to work on and persisting them. In DynamoDB, one needs to think in terms of the actual answers needed. From these, the queries can be induced and thus the shape of the model as well.

Also, DynamoDB auto–scales based on actual uses of the throughput. This scaling has some delays and any requests that happen to be over capacity will result in an exception being thrown into the application. Read requests can and will generally be retried -- the worst case scenario being that a customer refreshes the offending page on our dashboard -- but writes have to be handled a bit more carefully.

To ensure that all writes are always performed, we had to add a recovery system on failed writes. This system reuses our digestion system; it will push the offending object into an SQS queue to be digested (i.e., written) at a later date. We simply retry every few minutes until DynamoDB has had enough time to scale up. For these reasons and others, we progressively rolled DynamoDB usage as a principal storage solution across our customers. We made heavy use of feature flagging and experiment systems.

Blocking attacks in real-time with stream processing

Up to this point, most of what was required of the backend was handling reports coming from different agents. The agents detect and block attacks at runtime directly inside the application without back and forth with our backend, and then send reports to the backend afterward. These reports are normally consumed by humans (our customers) that want to be aware of attacks and proactively fix the vulnerabilities we identify in their production code. If we were to somehow have a small digestion lag, this wouldn't be much of a problem. However, by analyzing security metadata across our installed agent base, we found that we could go one step further to help our customers secure their application. In addition to blocking requests triggering vulnerabilities, we could also detect additional attacks based on the data collected inside the app and block attackers altogether.

Agents themselves can’t detect these attacks. Each agent only sees a segment of activity (generally, one single application process). A standard production environment is composed of dozens to thousands of these processes. To correctly detect such large-scale attacks, we had to merge the reports from all of the agents running for an application. Fortunately, we were already receiving these reports, pushing them to a queue, and digesting them independently. In order to detect large-scale attack activity, we "simply" had to work on all the associated reports together. As it turns out, there is an easy way to handle this: stream processing.

There were two main solutions for stream processing available at the time: Kafka or AWS Kinesis. Kafka is the industry standard for stream processing. It's very robust and very feature–complete. It's also known to be difficult to configure and operate, and there were no managed Kafka options available at the time. By comparison, AWS Kinesis is a managed system. It had all the necessary features we needed, although the consumer libraries (called KCL) are only directly available in Java. To use the consuming capability from other languages, one can either write their own consuming libraries or use a small Java bridge called the MultilangDaemon. The MultilangDaemon is less efficient than consuming Kinesis directly but is much faster to set up, so we started experimenting with stream processing using a mix of Kinesis, KCL through the MultilangDaemon, and Python for our digestion logic.

The final piece of software works on all attacks sent by each application and can then detect attacks by using simple accumulation algorithms. Once a large-scale attack is detected, we extract the major actors and emit security responses back to our agent. Said security responses will ban or redirect the offending actor for some period of time. The system was actually so easy to use that we made it available directly to our customers via a new part of the product that we called Playbooks. Playbooks explore a series of related events that satisfy a set of user-defined preconditions and respond as the user dictates. For example, our customers could easily block users that abuse critical features, which would indicate a business-logic attack.

The guiding principles of our current scalable architecture

Today, we have an architecture designed for scale, and we follow a few guiding principles to make sure that we continue to build scalable architecture in the future.

Cache when possible

We leverage caching where possible to handle large volumes of requests on our backend. Our security rules do not change very often in comparison to the number of requests sent to our servers, so we’re largely able to cache them. We’re able to slim this down further since most customers don’t need the full set of rules and don’t hot (re)load packages.

We use memcached for our caching needs. Memcached is an in-memory server that uses a fixed pool of memory to store any data you send it in a key–value manner. It is extremely fast — capable of performing ~200k reads per second¹ when well tuned, so it is much faster than calculating the ruleset to send each time.

Do as few writes as possible by using queues

For most of our data, we don’t need to perform a write immediately. We only need to ensure that it will be written eventually. We used the classic solution for this by implementing queues. When the part of the backend that talks with the agent receives a piece of data, we check that it looks OK and then repost it to a queue. Another part of the backend, that we are calling digestion, gets the data from the queue, reads it, and does the actual database insertion. We can then independently scale the part of the backend that responds to the agent and the part that is responsible for writing the data to the database.

When we chose our queue system, we explored a few options, and came down to Redis or an AWS core service called SQS. Redis is in-memory, which is risky for production, but SQS is limited to cloud-only. So we decided to split the difference. For local development, we use Redis, and we use SQS for production. Our dual setup here means that we can use a simple DB in local mode and a much more scalable queue system when deploying to staging and production.

Be super nimble to deploy

Since day one at Sqreen, we’ve focused on making deployments easy, so we can do them multiple times a day, without any perceivable impact to our customers. We use automated testing for this. We leverage Jenkins for some testing, and every commit pushed to GitHub in a PR has to pass through our test suite. Our test runners all publish coverage results that we collect and analyze using Codecov.

To facilitate deploying to production, we decided early on that what we are actually interested in is deploying services, not servers. As such, we chose to encapsulate our application code in Docker containers, with Amazon ECS as the orchestration layer. Our containers are actually built by Jenkins. Once the tests are all passed, Jenkins builds the new Docker image, uploads it to our private repository in ECS, and updates our staging, directly deploying this new image. At deploy time, the exact same image gets deployed to production, in true 12-factor app fashion -- only the environment will differ in production. If we ever need to scale up or down, ECS just needs to launch more or fewer of these images.

Sqreen’s scaling continues

At Sqreen, we’re scaling quickly, both as a company and with our infrastructure. Does creating a transparent security solution that security teams and developers both love sounds interesting to you? For more of these challenges and so much more, come join us in building the future of security at Sqreen!

Notes

Âą https://github.com/memcached/memcached/wiki/Performance#handles-extremely-high-load

unsplash-logoBanner Background Image by Taylor Vick

Sqreen
Sqreen is a developer-friendly security platform for web applications, API and micro-services
Tools mentioned in article
Open jobs at Sqreen
Back End Engineer
Paris
Who we are Sqreen is the industry’s first provider of Application Security Management (ASM), unifying application security needs into one single platform, giving over 600 companies unprecedented visibility and protection in production. Sqreen enables developers, operations and security teams to scale their security without impacting engineering velocity. The company was founded by security veterans who previously led the offensive security team at Apple. Sqreen is backed by Greylock Partners, Y Combinator, Alven and Point Nine. For more information, please visit www.sqreen.com. What we need your help with As a key part of our product engineering team, you'll be building the back end of Sqreen. Today, our platform analyses billions of security events and metrics sent by our agents, detecting threats and allowing our customers' applications to react in near real time. Day-to-day you'll: > Work with security, back-end and agent teams to architect, build, and launch new features > Maintain and improve our production infrastructure > Stay close to our users and our developer community via regular conferences and meetups > Contribute to and improve our engineering standards, tooling, and processes > Ensure the availability, latency, performance and efficiency of our platform Who you are We'd love to hear from you if you have: > Excellent user empathy > Great communication skills > High level of emotional intelligence (EQ) > A conversational level of English or above (it's our working language) > Some experience developing, maintaining and running back end applications - whilst not an entry level role, it would suit someone looking for a change from their initial software engineering position > The desire to work on a Python codebase - most of Sqreen's backend team started working with Python at Sqreen, switching from Ruby, C#, Java and other languages > A product mindset and enjoy working with people outside your area of expertise - every engineer at Sqreen collaborates daily with Sales, Support and other teams Working under our Director of Engineering, Beth, this role also comes with the opportunity for 1:1 technical mentoring with Sqreen's Principal Engineer, Benoit. Sqreeners come from all corners of the globe and all walks of life. Diversity is important to us, so if you're unsure, please apply!
C Agent Engineer
Paris
Who we are Sqreen is the industry’s first provider of Application Security Management (ASM), unifying application security needs into one single platform, giving over 600 companies unprecedented visibility and protection in production. Sqreen enables developers, operations and security teams to scale their security without impacting engineering velocity. The company was founded by security veterans who previously led the offensive security team at Apple. Sqreen is backed by Greylock Partners, Y Combinator, Alven and Point Nine. For more information, please visit www.sqreen.com. What we need your help with As part of the Paris-based product engineering team you'll be in charge of designing and implementing one of the Sqreen agents. Our technology protects web applications at run-time through dynamic instrumentation. Work on server modules performing dynamic instrumentation, application security protection, and run-time monitoring to protect our customers' applications. Your challenge? Make our customers' applications as safe as possible without compromising on performance. Day to day you'll: > Own, build, test and release a Sqreen agent > Design and implement key features, protocols and protections running within our customers' applications > Use modern development methods and tools to put your code in production quickly > Work with backend teams to enhance data flow, performance and reliability > Produce technical blog posts and document the software you ship > Attend conferences and meetups to share your work > Forecast the language and runtime evolutions to ensure the Sqreen agent has a first in class support > Follow the language ecosystem evolutions to stay close to new frameworks, popular deployment mechanisms etc. Who you are We'd love to hear from you if you have: > A strong interest in leading implementation of a mission critical piece of software > The ability to design custom protocols, services and frameworks > Strong experience with C programming > Experience with C++ and Python is a bonus! > A high consideration of reliability, low-latency, redundancy, and security > Excellent communication skills (we work in English) > A product mindset and enjoy working with people outside your area of expertise - every engineer at Sqreen collaborates daily with Sales, Support and other teams Sqreeners come from all corners of the globe and all walks of life. Diversity is important to us, so if you're unsure, please apply!
Engineering Manager
Paris
Who we are Sqreen is the industry’s first provider of Application Security Management (ASM), unifying application security needs into one single platform, giving over 600 companies unprecedented visibility and protection in production. Sqreen enables developers, operations and security teams to scale their security without impacting engineering velocity. The company was founded by security veterans who previously led the offensive security team at Apple. Sqreen is backed by Greylock Partners, Y Combinator, Alven and Point Nine. For more information, please visit www.sqreen.com. What we need your help with As a key part of our product engineering team, you'll be in charge of the direct management of 6-8 software engineers. Reporting to Beth, our talented Director of Engineering, you'll work closely with our CTO and Head of Product to deliver our ambitious 2019 roadmap and ensure our product engineers have everything they need to succeed and grow. This role will suit someone who defines success as enabling others to succeed, and who sees management as being at the service of a team, not the other way around. Day to day you'll: > Run 1:1 meetings with each member of your team > Set goals aligned with company objectives > Give feedback and act as a coach and mentor to your team > Ensure team health, knowledge sharing and alignment across the broader Sqreen team > Add value to technical discussions without acting as the decision-maker > Enable your team to deliver timely product engineering solutions to meet user needs > Actively contribute to building a healthy, collaborative modern engineering and organisational culture > Stay close to our developer community and seek and contribute best practises via regular conferences and meetups Who you are At Sqreen, we understand that management is not engineering, so we'd love to hear from you if you have: > Low ego and high emotional intelligence (EQ) > Excellent communication skills, capable of adapting your style to different people and situations > A love for growing and developing people and a track record of doing it > Knowledge and passion for software development > Fluent spoken and written English (our working language) > Experience working in a high-growth early-stage startup (bonus) > A product mindset and enjoy working with people outside your area of expertise - every Sqreener collaborates daily with Engineering, Sales, Support and other teams Sqreeners come from all corners of the globe and all walks of life. Diversity is important to us, so if you're unsure, please apply!
Front End Engineer
Paris
Who we are Sqreen is the industry’s first provider of Application Security Management (ASM), unifying application security needs into one single platform, giving over 600 companies unprecedented visibility and protection in production. Sqreen enables developers, operations and security teams to scale their security without impacting engineering velocity. The company was founded by security veterans who previously led the offensive security team at Apple. Sqreen is backed by Greylock Partners, Y Combinator, Alven and Point Nine. For more information, please visit www.sqreen.com. What we need your help with As part of the product engineering team you'll be responsible for developing the Sqreen dashboard. What is the dashboard, you ask? It's the essence of Sqreen itself - where complex, previously hidden-away security information is made clear and actionable for our customers. Put another way, it's a critical product for Sqreen, and the proverbial tip of the iceberg. Customers tell us 'before Sqreen, I was blind', and you'll make that happen. Check it out for yourself: Sqreen dashboard Day-to-day you'll: > Work with our world-class product, back-end and marketing teams to build features Sqreen users love > Build and test robust, well structured, fast and reusable components > Identify and solve performance issues > Participate in the project's design and code reviews (Github) Who you are We'd love to hear from you if you have: > Excellent user empathy > Great communication skills  > High level of emotional intelligence (EQ) > A conversational level of English or above (it's our working language) > Some experience building modern web applications - whilst not an entry level role, it would suit someone looking for a change from their initial software engineering position > Experience with JavaScript, ES6 and React > Experience working with HTTP REST APIs > Experience with Redux, Redux-saga, Recompose, Immutable, Reselect or other react libraries a plus > A product mindset and enjoy working with people outside your area of expertise - every engineer at Sqreen collaborates daily with Sales, Support and other teams. Sqreeners come from all corners of the globe and all walks of life. Diversity is important to us, so if you're unsure, please apply!
Verified by
Principal Engineer
Co-founder & CEO
You may also like
Rethinking Front-end Error Reporting
WebSockets vs Long Polling
Server-side rendering: how to serve authenticated content
Building Realtime Apps in 2019 with PubNub