Alternatives to AWS Data Pipeline logo

Alternatives to AWS Data Pipeline

AWS Glue, Airflow, AWS Step Functions, Apache NiFi, and AWS Batch are the most popular alternatives and competitors to AWS Data Pipeline.
88
341
+ 1
1

What is AWS Data Pipeline and what are its top alternatives?

AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.
AWS Data Pipeline is a tool in the Data Transfer category of a tech stack.

Top Alternatives to AWS Data Pipeline

  • AWS Glue

    AWS Glue

    A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. ...

  • Airflow

    Airflow

    Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed. ...

  • AWS Step Functions

    AWS Step Functions

    AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly. ...

  • Apache NiFi

    Apache NiFi

    An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. ...

  • AWS Batch

    AWS Batch

    It enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. ...

  • Azure Data Factory

    Azure Data Factory

    It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud. ...

  • Embulk

    Embulk

    It is an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. ...

  • Google BigQuery Data Transfer Service

    Google BigQuery Data Transfer Service

    BigQuery Data Transfer Service lets you focus your efforts on analyzing your data. You can setup a data transfer with a few clicks. Your analytics team can lay the foundation for a data warehouse without writing a single line of code. ...

AWS Data Pipeline alternatives & related posts

AWS Glue logo

AWS Glue

300
549
6
Fully managed extract, transform, and load (ETL) service
300
549
+ 1
6
PROS OF AWS GLUE
  • 6
    Managed Hive Metastore
CONS OF AWS GLUE
    Be the first to leave a con

    related AWS Glue posts

    Pardha Saradhi
    Technical Lead at Incred Financial Solutions · | 6 upvotes · 32.2K views

    Hi,

    We are currently storing the data in Amazon S3 using Apache Parquet format. We are using Presto to query the data from S3 and catalog it using AWS Glue catalog. We have Metabase sitting on top of Presto, where our reports are present. Currently, Presto is becoming too costly for us, and we are looking for alternatives for it but want to use the remaining setup (S3, Metabase) as much as possible. Please suggest alternative approaches.

    See more
    Punith Ganadinni
    Senior Product Engineer · | 2 upvotes · 24.5K views

    Hey all, I need some suggestions in creating a replica of our RDS DB for reporting and analytical purposes. Cost is a major factor. I was thinking of using AWS Glue to move data from Amazon RDS to Amazon S3 and use Amazon Athena to run queries on it. Any other suggestions would be appreciable.

    See more
    Airflow logo

    Airflow

    1.2K
    2.1K
    116
    A platform to programmaticaly author, schedule and monitor data pipelines, by Airbnb
    1.2K
    2.1K
    + 1
    116
    PROS OF AIRFLOW
    • 45
      Features
    • 14
      Task Dependency Management
    • 12
      Beautiful UI
    • 11
      Cluster of workers
    • 10
      Extensibility
    • 5
      Open source
    • 4
      Python
    • 4
      Complex workflows
    • 3
      K
    • 2
      Custom operators
    • 2
      Good api
    • 2
      Dashboard
    • 2
      Apache project
    CONS OF AIRFLOW
    • 2
      Open source - provides minimum or no support
    • 1
      Logical separation of DAGs is not straight forward
    • 1
      Running it on kubernetes cluster relatively complex
    • 1
      Observability is not great when the DAGs exceed 250

    related Airflow posts

    Shared insights
    on
    JenkinsJenkinsAirflowAirflow

    I am looking for an open-source scheduler tool with cross-functional application dependencies. Some of the tasks I am looking to schedule are as follows:

    1. Trigger Matillion ETL loads
    2. Trigger Attunity Replication tasks that have downstream ETL loads
    3. Trigger Golden gate Replication Tasks
    4. Shell scripts, wrappers, file watchers
    5. Event-driven schedules

    I have used Airflow in the past, and I know we need to create DAGs for each pipeline. I am not familiar with Jenkins, but I know it works with configuration without much underlying code. I want to evaluate both and appreciate any advise

    See more
    Shared insights
    on
    AWS Step FunctionsAWS Step FunctionsAirflowAirflow

    I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.

    I already have software patterns from other projects for Airflow + Batch but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Another option would be to have one task that kicks off the 10k containers and monitors it from there.

    I have no experience with AWS Step Functions but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Do Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?

    See more
    AWS Step Functions logo

    AWS Step Functions

    180
    308
    22
    Build Distributed Applications Using Visual Workflows
    180
    308
    + 1
    22
    PROS OF AWS STEP FUNCTIONS
    • 5
      Integration with other services
    • 4
      Pricing
    • 4
      Easily Accessible via AWS Console
    • 3
      Complex workflows
    • 2
      Scalability
    • 2
      High Availability
    • 2
      Workflow Processing
    CONS OF AWS STEP FUNCTIONS
      Be the first to leave a con

      related AWS Step Functions posts

      Shared insights
      on
      AWS Step FunctionsAWS Step FunctionsAirflowAirflow

      I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.

      I already have software patterns from other projects for Airflow + Batch but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Another option would be to have one task that kicks off the 10k containers and monitors it from there.

      I have no experience with AWS Step Functions but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Do Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?

      See more
      Apache NiFi logo

      Apache NiFi

      263
      534
      62
      A reliable system to process and distribute data
      263
      534
      + 1
      62
      PROS OF APACHE NIFI
      • 15
        Visual Data Flows using Directed Acyclic Graphs (DAGs)
      • 8
        Free (Open Source)
      • 7
        Simple-to-use
      • 5
        Reactive with back-pressure
      • 5
        Scalable horizontally as well as vertically
      • 4
        Fast prototyping
      • 3
        Bi-directional channels
      • 2
        Data provenance
      • 2
        Built-in graphical user interface
      • 2
        End-to-end security between all nodes
      • 2
        Can handle messages up to gigabytes in size
      • 1
        Hbase support
      • 1
        Kudu support
      • 1
        Hive support
      • 1
        Slack integration
      • 1
        Support for custom Processor in Java
      • 1
        Lot of articles
      • 1
        Lots of documentation
      CONS OF APACHE NIFI
      • 2
        HA support is not full fledge
      • 2
        Memory-intensive

      related Apache NiFi posts

      I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?

      For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?

      See more
      AWS Batch logo

      AWS Batch

      65
      189
      4
      Fully Managed Batch Processing at Any Scale
      65
      189
      + 1
      4
      PROS OF AWS BATCH
      • 2
        Scalable
      • 2
        Containerized
      CONS OF AWS BATCH
      • 1
        More overhead than lambda
      • 1
        Image management

      related AWS Batch posts

      Sumit Singh Chauhan
      Data Scientist at Entropik · | 6 upvotes · 5K views

      I have started using AWS Batch for some long ML inference jobs. So far it's working well and giving a decent performance. Since it is fully managed, it saves a lot of extra work as well. But Batch takes a good amount of time to create a new cluster and then load the job based on the priority of the queue. Going forward would love to put effort into something which is fast to start and give more flexibility as well. What other tools you would suggest for long-running backend jobs which can scale well. I am not looking for something fully managed so ignore the options similar to batch in Google Cloud Platform or Microsoft Azure, Looking for open-source alternatives here. Do you think Kubernetes, RabbitMQ/Kafka will be a good fit or just overkill for my problem. Usually w we get 1000s of requests in parallel and each job might take 20-30 mins in a 2 vCPU system.

      See more
      Azure Data Factory logo

      Azure Data Factory

      160
      317
      0
      Hybrid data integration service that simplifies ETL at scale
      160
      317
      + 1
      0
      PROS OF AZURE DATA FACTORY
        Be the first to leave a pro
        CONS OF AZURE DATA FACTORY
          Be the first to leave a con

          related Azure Data Factory posts

          Vamshi Krishna
          Data Engineer at Tata Consultancy Services · | 4 upvotes · 103.9K views

          I have to collect different data from multiple sources and store them in a single cloud location. Then perform cleaning and transforming using PySpark, and push the end results to other applications like reporting tools, etc. What would be the best solution? I can only think of Azure Data Factory + Databricks. Are there any alternatives to #AWS services + Databricks?

          See more
          Embulk logo

          Embulk

          22
          16
          0
          Bulk data loader that helps data transfer between various databases
          22
          16
          + 1
          0
          PROS OF EMBULK
            Be the first to leave a pro
            CONS OF EMBULK
              Be the first to leave a con

              related Embulk posts

              Google BigQuery Data Transfer Service logo

              Google BigQuery Data Transfer Service

              13
              17
              0
              Automate data movement from SaaS applications to Google BigQuery on a scheduled, managed basis
              13
              17
              + 1
              0
              PROS OF GOOGLE BIGQUERY DATA TRANSFER SERVICE
                Be the first to leave a pro
                CONS OF GOOGLE BIGQUERY DATA TRANSFER SERVICE
                  Be the first to leave a con

                  related Google BigQuery Data Transfer Service posts