Airflow logo
A platform to programmaticaly author, schedule and monitor data pipelines, by Airbnb
269
215
15

What is Airflow?

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed.
Airflow is a tool in the Workflow Manager category of a tech stack.
Airflow is an open source tool with 13.5K GitHub stars and 5K GitHub forks. Here’s a link to Airflow's open source repository on GitHub

Who uses Airflow?

Companies
98 companies reportedly use Airflow in their tech stacks, including Airbnb, Slack, and 9GAG.

Developers
161 developers on StackShare have stated that they use Airflow.

Why developers like Airflow?

Here’s a list of reasons why companies and developers use Airflow
Airflow Reviews

Here are some stack decisions, common use cases and reviews by companies and developers who chose Airflow in their tech stack.

StackShare Editors
StackShare Editors
Flask
AWS EC2
Celery
Datadog
PagerDuty
Airflow
StatsD
Grafana

Data science and engineering teams at Lyft maintain several big data pipelines that serve as the foundation for various types of analysis throughout the business.

Apache Airflow sits at the center of this big data infrastructure, allowing users to “programmatically author, schedule, and monitor data pipelines.” Airflow is an open source tool, and “Lyft is the very first Airflow adopter in production since the project was open sourced around three years ago.”

There are several key components of the architecture. A web UI allows users to view the status of their queries, along with an audit trail of any modifications the query. A metadata database stores things like job status and task instance status. A multi-process scheduler handles job requests, and triggers the executor to execute those tasks.

Airflow supports several executors, though Lyft uses CeleryExecutor to scale task execution in production. Airflow is deployed to three Amazon Auto Scaling Groups, with each associated with a celery queue.

Audit logs supplied to the web UI are powered by the existing Airflow audit logs as well as Flask signal.

Datadog, Statsd, Grafana, and PagerDuty are all used to monitor the Airflow system.

See more
StackShare Editors
StackShare Editors
Apache Thrift
Kotlin
Presto
HHVM (HipHop Virtual Machine)
gRPC
Kubernetes
Apache Spark
Airflow
Terraform
Hadoop
Swift
Hack
Memcached
Consul
Chef
Prometheus

Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

Apps
  • Web: a mix of JavaScript/ES6 and React.
  • Desktop: And Electron to ship it as a desktop application.
  • Android: a mix of Java and Kotlin.
  • iOS: written in a mix of Objective C and Swift.
Backend
  • The core application and the API written in PHP/Hack that runs on HHVM.
  • The data is stored in MySQL using Vitess.
  • Caching is done using Memcached and MCRouter.
  • The search service takes help from SolrCloud, with various Java services.
  • The messaging system uses WebSockets with many services in Java and Go.
  • Load balancing is done using HAproxy with Consul for configuration.
  • Most services talk to each other over gRPC,
  • Some Thrift and JSON-over-HTTP
  • Voice and video calling service was built in Elixir.
Data warehouse
  • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
Etc
See more
Airflow

I use Airflow because it's the gold standard for scheduling batch data jobs. It comes with a bit of a learning curve given the extensive UI and working with different connectors. However, it has a lot of great retry features, and the visual DAGS help with a lot of troubleshooting.

See more

Airflow's Features

  • Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writting code that instantiate pipelines dynamically.
  • Extensible: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
  • Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built in the core of Airflow using powerful Jinja templating engine.
  • Scalable: Airflow has a modular architecture and uses a message queue to talk to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.

Airflow Alternatives & Comparisons

What are some alternatives to Airflow?
Luigi
It is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache NiFi
An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
Apache Beam
It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines on multiple execution environments.
Apache Oozie
It is a server-based workflow scheduling system to manage Hadoop jobs. Workflows in it are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow as well as a mechanism to control the workflow execution path.
Camunda
It is an open source platform for workflow and decision automation that brings business users and software developers together.
See all alternatives

Airflow's Stats

Airflow's Followers
215 developers follow Airflow to keep up with related blogs and decisions.
Jacob Kim
Robert Butler
Bart Jonk
davzoku
Yury Buldakov
Kent Horvath
Uwe Sandner
Richard  Rickss
Baob0b
Abhishek kumar