How Codacy Analyzes 30 Billion Lines of Code Per Day

Editor's Note: Jaime Jorge is co-founder and CEO at Codacy.

Codacy helps dev teams of all sizes to automate their code quality by identifying issues through static code analysis, both in the cloud and on-premise. The product notifies users about security issues, code coverage, code duplication and code complexity in every commit and pull request, directly from their current workflow. We sat down with Jaime to learn more about the technology behind Codacy's automated code review platform.

StackShare: Why did you and your other co-founder create Codacy?

Jaime Jorge: Being both developers, we started the company because we wanted to help developers focus on software development instead of just fixing code. I was researching this topic for my master's thesis (working with Telcos in Europe) to understand technical debt (in terms of code duplication), and Joao (my co-founder) was leading tech teams in the financial industry in the UK. What brought us together was the mission of helping as many developers and companies as we could to ship better code and increase their productivity.

Founded in 2012, Codacy now employs 40 people (more than half of which are technical) between our offices in Lisbon and NYC.

StackShare: Out of the 28 supported languages, which one do you see used the most on your platform?

JJ: The usage distribution of our supported programming languages follows what you’d expect to see looking at indexes/ranks like the one from TIOBE. The most used language in Codacy is Javascript. This is a result of a strong clustering of web development use cases. We then see Java, Python, Ruby and a few others close behind.

StackShare: It’s amazing how small your team is yet you support so many different languages.

JJ: When we started Codacy, we only supported Scala (on which our product is built). Following requests from new users over time, we started adding additional language support. We understood that modern development does not rely on one programming language alone, and modern tech stacks most often have a combination of many different languages. This forced us to create a platform that would make it easy for us to add new programming languages but also update their support. We also allowed for our users to bring their own support by exposing our integration mechanism.

StackShare: How do you use Codacy to build Codacy?

JJ: Our team uses Codacy every day, primarily to maintain the same criteria of development (formatting, coverage, best practices) across the different dev squads. There are features we use more often than others, which mirrors what we see from our customers.

StackShare: Which features do your team use most often?

JJ: Some team members like to use the dashboards to keep track of the main quality metrics, some like the build status we provide to make sure we’re within the defined criteria. All of the team uses the auto-comment feature, which helps our teams stay in-touch.

StackShare: What platforms do you integrate with?

JJ: Our most popular integrations are with GitHub, GitLab, Bitbucket, CircleCI, Jenkins, and Slack, although we support many others.

StackShare: How does Codacy provide notifications for security issues?

JJ: As part of our code analysis, we provide security notifications via the tools we integrate with.

StackShare: Tell us about your secure development practices?

JJ: We develop following security best practices and frameworks (OWASP Top 10, SANS Top 25). Our developers participate in regular security training to learn about common vulnerabilities and threats, and we review our code for security vulnerabilities. We also regularly update our dependencies and make sure none of them has known vulnerabilities.

Our teams use Static Application Security Testing (SAST) to detect basic security vulnerabilities in our codebase, and Dynamic Application Security Testing (DAST) to scan our applications.

StackShare: What’s the biggest issue new developers make when setting up an automated code review system?

JJ: Incorrect or incomplete configuration.

StackShare: How many automated code reviews do you process daily?

JJ: We pull about 8TB per day which, assuming 1 byte per character and 256 characters per line, we arrive at ~ 3*10^10 lines (about 30 billion). Interesting to note, this is about 40% of the text content in the Library of Congress (according to wolfram alpha)

StackShare: How do you store all of that data?

JJ: All of our services run in the cloud on AWS. We don’t host or run our own routers, load balancers, DNS servers, or physical servers.

StackShare: What AWS services do you use specifically for getting that data processed, indexed and stored?

JJ: Data is processed using EC2 instances. We currently run our applications using Docker on Elastic Beanstalk, but we are transitioning to EKS. The data is stored on RDS, where we use both Aurora and Postgres. Although the volume of data we pull to analyze is 8TB, the analysis results (that we actually store) are significantly smaller. You don’t need the code verbatim for every source file - you just store the issues and where in the file you found them. We then leverage AWS to scale elastically (e.g. the number of active analysis servers) with the current load.

StackShare: Does this process still involve Scala or another language?

JJ: Our applications are all implemented in Scala. They do all the heavy lifting regarding data processing/indexing.

StackShare: How long do you retain that data?

JJ: The repositories are cloned, analyzed and then deleted.

Thanks for reading! If you use Codacy you can add them to your stack here.