Need advice about which tool to choose?Ask the StackShare community!

Airflow

984
1.7K
+ 1
100
Apache Spark

2.1K
2.4K
+ 1
131
Add tool

Airflow vs Apache Spark: What are the differences?

What is Airflow? A platform to programmaticaly author, schedule and monitor data pipelines, by Airbnb. Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed.

What is Apache Spark? Fast and general engine for large-scale data processing. Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Airflow can be classified as a tool in the "Workflow Manager" category, while Apache Spark is grouped under "Big Data Tools".

Some of the features offered by Airflow are:

  • Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writting code that instantiate pipelines dynamically.
  • Extensible: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
  • Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built in the core of Airflow using powerful Jinja templating engine.

On the other hand, Apache Spark provides the following key features:

  • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  • Write applications quickly in Java, Scala or Python
  • Combine SQL, streaming, and complex analytics

Airflow and Apache Spark are both open source tools. It seems that Apache Spark with 22.5K GitHub stars and 19.4K forks on GitHub has more adoption than Airflow with 12.9K GitHub stars and 4.71K GitHub forks.

According to the StackShare community, Apache Spark has a broader approval, being mentioned in 266 company stacks & 112 developers stacks; compared to Airflow, which is listed in 72 company stacks and 33 developer stacks.

Pros of Airflow
Pros of Apache Spark
  • 39
    Features
  • 12
    Task Dependency Management
  • 11
    Beautiful UI
  • 9
    Cluster of workers
  • 9
    Extensibility
  • 5
    Open source
  • 4
    Python
  • 3
    Complex workflows
  • 2
    K
  • 2
    Dashboard
  • 2
    Custom operators
  • 1
    Good api
  • 1
    Apache project
  • 58
    Open-source
  • 47
    Fast and Flexible
  • 7
    One platform for every big data problem
  • 6
    Easy to install and to use
  • 6
    Great for distributed SQL like applications
  • 3
    Works well for most Datascience usecases
  • 2
    Machine learning libratimery, Streaming in real
  • 2
    In memory Computation
  • 0
    Interactive Query

Sign up to add or upvote prosMake informed product decisions

Cons of Airflow
Cons of Apache Spark
    Be the first to leave a con
    • 2
      Speed

    Sign up to add or upvote consMake informed product decisions

    What is Airflow?

    Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed.

    What is Apache Spark?

    Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

    Need advice about which tool to choose?Ask the StackShare community!

    What companies use Airflow?
    What companies use Apache Spark?

    Sign up to get full access to all the companiesMake informed product decisions

    What tools integrate with Airflow?
    What tools integrate with Apache Spark?

    Sign up to get full access to all the tool integrationsMake informed product decisions

    Blog Posts

    MySQLKafkaApache Spark+6
    2
    1341
    Aug 28 2019 at 3:10AM
    https://img.stackshare.io/stack/505487/default_e35b8bd5e615e01dc9b420dbd2a444fcbaeff755.png logo

    Segment

    PythonJavaAmazon S3+16
    5
    1893
    What are some alternatives to Airflow and Apache Spark?
    Luigi
    It is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
    Apache NiFi
    An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
    Jenkins
    In a nutshell Jenkins CI is the leading open-source continuous integration server. Built with Java, it provides over 300 plugins to support building and testing virtually any project.
    AWS Step Functions
    AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly.
    Pachyderm
    Pachyderm is an open source MapReduce engine that uses Docker containers for distributed computations.
    See all alternatives
    Interest over time
    News about Apache Spark
    More news