Airflow logo

Airflow

A platform to programmaticaly author, schedule and monitor data pipelines, by Airbnb
738
1.1K
+ 1
69

What is Airflow?

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed.
Airflow is a tool in the Workflow Manager category of a tech stack.
Airflow is an open source tool with 17.3K GitHub stars and 6.7K GitHub forks. Here’s a link to Airflow's open source repository on GitHub

Who uses Airflow?

Companies
168 companies reportedly use Airflow in their tech stacks, including Airbnb, Slack, and Robinhood.

Developers
554 developers on StackShare have stated that they use Airflow.

Airflow Integrations

Private Decisions at about Airflow

Here are some stack decisions, common use cases and reviews by members of with Airflow in their tech stack.

Christopher Davison
Christopher Davison
DevOps Engineer at Soulmates.ai · | 1 upvotes · 0 views
Shared insights
on
AirflowAirflow

Used for scheduling ETL jobs Airflow

See more
Eugene Ivanchenko
Eugene Ivanchenko
Software engineer at NCBI · | 1 upvotes · 0 views
Shared insights
on
AirflowAirflow

Manage the calculation pipeline and data distribution procedures. Airflow

See more
Shared insights
on
AirflowAirflow

I use Airflow because it's the gold standard for scheduling batch data jobs. It comes with a bit of a learning curve given the extensive UI and working with different connectors. However, it has a lot of great retry features, and the visual DAGS help with a lot of troubleshooting.

See more
Shared insights
on
JenkinsJenkinsAirflowAirflow

I am looking for an open-source scheduler tool with cross-functional application dependencies. Some of the tasks I am looking to schedule are as follows:

  1. Trigger Matillion ETL loads
  2. Trigger Attunity Replication tasks that have downstream ETL loads
  3. Trigger Golden gate Replication Tasks
  4. Shell scripts, wrappers, file watchers
  5. Event-driven schedules

I have used Airflow in the past, and I know we need to create DAGs for each pipeline. I am not familiar with Jenkins, but I know it works with configuration without much underlying code. I want to evaluate both and appreciate any advise

See more
sunilsy08
sunilsy08
Software Developer at WedMeGood · | 1 upvotes · 10.8K views
Shared insights
on
PythonPythonAirflowAirflowNode.jsNode.js

I need to implement a Node.js cron scheduler like Airflow. Is it possible to implement it without working on Python? Till now, all my jobs are running on my server only via internal script calling another job scripts. Any alternative or better way to implement?

See more

I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?

For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?

See more
Public Decisions about Airflow

Here are some stack decisions, common use cases and reviews by companies and developers who chose Airflow in their tech stack.

Shared insights
on
JenkinsJenkinsAirflowAirflow

I am looking for an open-source scheduler tool with cross-functional application dependencies. Some of the tasks I am looking to schedule are as follows:

  1. Trigger Matillion ETL loads
  2. Trigger Attunity Replication tasks that have downstream ETL loads
  3. Trigger Golden gate Replication Tasks
  4. Shell scripts, wrappers, file watchers
  5. Event-driven schedules

I have used Airflow in the past, and I know we need to create DAGs for each pipeline. I am not familiar with Jenkins, but I know it works with configuration without much underlying code. I want to evaluate both and appreciate any advise

See more

Data science and engineering teams at Lyft maintain several big data pipelines that serve as the foundation for various types of analysis throughout the business.

Apache Airflow sits at the center of this big data infrastructure, allowing users to “programmatically author, schedule, and monitor data pipelines.” Airflow is an open source tool, and “Lyft is the very first Airflow adopter in production since the project was open sourced around three years ago.”

There are several key components of the architecture. A web UI allows users to view the status of their queries, along with an audit trail of any modifications the query. A metadata database stores things like job status and task instance status. A multi-process scheduler handles job requests, and triggers the executor to execute those tasks.

Airflow supports several executors, though Lyft uses CeleryExecutor to scale task execution in production. Airflow is deployed to three Amazon Auto Scaling Groups, with each associated with a celery queue.

Audit logs supplied to the web UI are powered by the existing Airflow audit logs as well as Flask signal.

Datadog, Statsd, Grafana, and PagerDuty are all used to monitor the Airflow system.

See more

I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?

For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?

See more

Since the beginning, Cal Henderson has been the CTO of Slack. Earlier this year, he commented on a Quora question summarizing their current stack.

Apps
  • Web: a mix of JavaScript/ES6 and React.
  • Desktop: And Electron to ship it as a desktop application.
  • Android: a mix of Java and Kotlin.
  • iOS: written in a mix of Objective C and Swift.
Backend
  • The core application and the API written in PHP/Hack that runs on HHVM.
  • The data is stored in MySQL using Vitess.
  • Caching is done using Memcached and MCRouter.
  • The search service takes help from SolrCloud, with various Java services.
  • The messaging system uses WebSockets with many services in Java and Go.
  • Load balancing is done using HAproxy with Consul for configuration.
  • Most services talk to each other over gRPC,
  • Some Thrift and JSON-over-HTTP
  • Voice and video calling service was built in Elixir.
Data warehouse
  • Built using open source tools including Presto, Spark, Airflow, Hadoop and Kafka.
Etc
See more
Shared insights
on
AirflowAirflow

I use Airflow because it's the gold standard for scheduling batch data jobs. It comes with a bit of a learning curve given the extensive UI and working with different connectors. However, it has a lot of great retry features, and the visual DAGS help with a lot of troubleshooting.

See more
sunilsy08
sunilsy08
Software Developer at WedMeGood · | 1 upvotes · 10.8K views
Shared insights
on
PythonPythonAirflowAirflowNode.jsNode.js

I need to implement a Node.js cron scheduler like Airflow. Is it possible to implement it without working on Python? Till now, all my jobs are running on my server only via internal script calling another job scripts. Any alternative or better way to implement?

See more

Airflow's Features

  • Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writting code that instantiate pipelines dynamically.
  • Extensible: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
  • Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built in the core of Airflow using powerful Jinja templating engine.
  • Scalable: Airflow has a modular architecture and uses a message queue to talk to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.

Airflow Alternatives & Comparisons

What are some alternatives to Airflow?
Luigi
It is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache NiFi
An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
Jenkins
In a nutshell Jenkins CI is the leading open-source continuous integration server. Built with Java, it provides over 300 plugins to support building and testing virtually any project.
AWS Step Functions
AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly.
Pachyderm
Pachyderm is an open source MapReduce engine that uses Docker containers for distributed computations.
See all alternatives

Airflow's Followers
1098 developers follow Airflow to keep up with related blogs and decisions.
Harpreet Singh
Stijn Zanders
zmaaq946776
Kostas Katrinis
Akshay Deshpande
Nona Janssen Walls
Vladimir Rüntü
Heather H
khanh chau
Praveena Mallarkandy