Alternatives to AWS Data Pipeline logo

Alternatives to AWS Data Pipeline

AWS Glue, Airflow, AWS Step Functions, Apache NiFi, and AWS Batch are the most popular alternatives and competitors to AWS Data Pipeline.
95
398
+ 1
1

What is AWS Data Pipeline and what are its top alternatives?

AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.
AWS Data Pipeline is a tool in the Data Transfer category of a tech stack.

Top Alternatives to AWS Data Pipeline

  • AWS Glue
    AWS Glue

    A fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. ...

  • Airflow
    Airflow

    Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command lines utilities makes performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress and troubleshoot issues when needed. ...

  • AWS Step Functions
    AWS Step Functions

    AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly. ...

  • Apache NiFi
    Apache NiFi

    An easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. ...

  • AWS Batch
    AWS Batch

    It enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. ...

  • Azure Data Factory
    Azure Data Factory

    It is a service designed to allow developers to integrate disparate data sources. It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud. ...

  • Postman
    Postman

    It is the only complete API development environment, used by nearly five million developers and more than 100,000 companies worldwide. ...

  • Postman
    Postman

    It is the only complete API development environment, used by nearly five million developers and more than 100,000 companies worldwide. ...

AWS Data Pipeline alternatives & related posts

AWS Glue logo

AWS Glue

462
9
Fully managed extract, transform, and load (ETL) service
462
9
PROS OF AWS GLUE
  • 9
    Managed Hive Metastore
CONS OF AWS GLUE
    Be the first to leave a con

    related AWS Glue posts

    Will Dataflow be the right replacement for AWS Glue? Are there any unforeseen exceptions like certain proprietary transformations not supported in Google Cloud Dataflow, connectors ecosystem, Data Quality & Date cleansing not supported in DataFlow. etc?

    Also, how about Google Cloud Data Fusion as a replacement? In terms of No Code/Low code .. (Since basic use cases in Glue support UI, in that case, CDF may be the right choice ).

    What would be the best choice?

    See more
    Pardha Saradhi
    Technical Lead at Incred Financial Solutions · | 6 upvotes · 109.2K views

    Hi,

    We are currently storing the data in Amazon S3 using Apache Parquet format. We are using Presto to query the data from S3 and catalog it using AWS Glue catalog. We have Metabase sitting on top of Presto, where our reports are present. Currently, Presto is becoming too costly for us, and we are looking for alternatives for it but want to use the remaining setup (S3, Metabase) as much as possible. Please suggest alternative approaches.

    See more
    Airflow logo

    Airflow

    1.7K
    128
    A platform to programmaticaly author, schedule and monitor data pipelines, by Airbnb
    1.7K
    128
    PROS OF AIRFLOW
    • 53
      Features
    • 14
      Task Dependency Management
    • 12
      Beautiful UI
    • 12
      Cluster of workers
    • 10
      Extensibility
    • 6
      Open source
    • 5
      Complex workflows
    • 5
      Python
    • 3
      Good api
    • 3
      Apache project
    • 3
      Custom operators
    • 2
      Dashboard
    CONS OF AIRFLOW
    • 2
      Observability is not great when the DAGs exceed 250
    • 2
      Running it on kubernetes cluster relatively complex
    • 2
      Open source - provides minimum or no support
    • 1
      Logical separation of DAGs is not straight forward

    related Airflow posts

    Data science and engineering teams at Lyft maintain several big data pipelines that serve as the foundation for various types of analysis throughout the business.

    Apache Airflow sits at the center of this big data infrastructure, allowing users to “programmatically author, schedule, and monitor data pipelines.” Airflow is an open source tool, and “Lyft is the very first Airflow adopter in production since the project was open sourced around three years ago.”

    There are several key components of the architecture. A web UI allows users to view the status of their queries, along with an audit trail of any modifications the query. A metadata database stores things like job status and task instance status. A multi-process scheduler handles job requests, and triggers the executor to execute those tasks.

    Airflow supports several executors, though Lyft uses CeleryExecutor to scale task execution in production. Airflow is deployed to three Amazon Auto Scaling Groups, with each associated with a celery queue.

    Audit logs supplied to the web UI are powered by the existing Airflow audit logs as well as Flask signal.

    Datadog, Statsd, Grafana, and PagerDuty are all used to monitor the Airflow system.

    See more

    We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.

    See more
    AWS Step Functions logo

    AWS Step Functions

    240
    31
    Build Distributed Applications Using Visual Workflows
    240
    31
    PROS OF AWS STEP FUNCTIONS
    • 7
      Integration with other services
    • 5
      Easily Accessible via AWS Console
    • 5
      Complex workflows
    • 5
      Pricing
    • 3
      Scalability
    • 3
      Workflow Processing
    • 3
      High Availability
    CONS OF AWS STEP FUNCTIONS
      Be the first to leave a con

      related AWS Step Functions posts

      Shared insights
      on
      AWS Step FunctionsAWS Step FunctionsAirflowAirflow

      I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.

      I already have software patterns from other projects for Airflow + Batch but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Another option would be to have one task that kicks off the 10k containers and monitors it from there.

      I have no experience with AWS Step Functions but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Do Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?

      See more
      Matheus Moreira
      Backend Engineer at IntuitiveCare · | 5 upvotes · 249.3K views
      Shared insights
      on
      AWS Step FunctionsAWS Step FunctionsAirflowAirflow

      We have some lambdas we need to orchestrate to get our workflow going. In the past, we already attempted to use Airflow as the orchestrator, but the need to coordinate the tasks in a database generates an overhead that we cannot afford. For our use case, there are hundreds of inputs per minute and we need to scale to support all the inputs and have an efficient way to analyze them later. The ideal product would be AWS Step Functions since it can manage our load demand graciously, but it is too expensive and we cannot afford that. So, I would like to get alternatives for an orchestrator that does not need a complex backend, can manage hundreds of inputs per minute, and is not too expensive.

      See more
      Apache NiFi logo

      Apache NiFi

      353
      65
      A reliable system to process and distribute data
      353
      65
      PROS OF APACHE NIFI
      • 17
        Visual Data Flows using Directed Acyclic Graphs (DAGs)
      • 8
        Free (Open Source)
      • 7
        Simple-to-use
      • 5
        Scalable horizontally as well as vertically
      • 5
        Reactive with back-pressure
      • 4
        Fast prototyping
      • 3
        Bi-directional channels
      • 3
        End-to-end security between all nodes
      • 2
        Built-in graphical user interface
      • 2
        Can handle messages up to gigabytes in size
      • 2
        Data provenance
      • 1
        Lots of documentation
      • 1
        Hbase support
      • 1
        Support for custom Processor in Java
      • 1
        Hive support
      • 1
        Kudu support
      • 1
        Slack integration
      • 1
        Lot of articles
      CONS OF APACHE NIFI
      • 2
        HA support is not full fledge
      • 2
        Memory-intensive
      • 1
        Kkk

      related Apache NiFi posts

      John Calandra
      Data Manager at The Garrett Group · | 8 upvotes · 368.8K views

      There is a question coming... I am using Oracle VirtualBox to spawn 3 Ubuntu Linux virtual machines (VM). VM1 is being used as a data lake - just a place to store flat files. VM2 hosts Apache NiFi. VM3 hosts PostgreSQL. I have built a NiFi pipeline that reads flat files on VM1 and then pipes the data over to and inserts it into the Postgresql database. I left this setup alone for a while, and then something hiccupped on VM3, and I had to rebuild it. Now I cannot make a remote connection to Postgresql on VM3. I was using pgAdmin3 on VM3, but it kept throwing errors - I found out it went end-of-life in 2018 and uninstalled it. pgAdmin4 is out, but for some reason, I cannot get the APT utility to find/install it. I am trying to figure out the pgAdmin4 install problem and looking for a good alternative for pgAdmin4 that I can use to diagnose the remote database connection problem. Does anyone have any suggestions? Thanks in advance.

      See more

      I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?

      For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?

      See more
      AWS Batch logo

      AWS Batch

      90
      6
      Fully Managed Batch Processing at Any Scale
      90
      6
      PROS OF AWS BATCH
      • 3
        Containerized
      • 3
        Scalable
      CONS OF AWS BATCH
      • 3
        More overhead than lambda
      • 1
        Image management

      related AWS Batch posts

      Azure Data Factory logo

      Azure Data Factory

      248
      0
      Hybrid data integration service that simplifies ETL at scale
      248
      0
      PROS OF AZURE DATA FACTORY
        Be the first to leave a pro
        CONS OF AZURE DATA FACTORY
          Be the first to leave a con

          related Azure Data Factory posts

          Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

          1. Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
          2. Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
          3. Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
          4. Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
          5. Processing-> We want to use SAS if at all possible. What will work with SAS code?
          6. Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
          7. I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
          8. An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
          See more

          We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.

          See more
          Postman logo

          Postman

          95.1K
          1.8K
          Only complete API development environment
          95.1K
          1.8K
          PROS OF POSTMAN
          • 490
            Easy to use
          • 369
            Great tool
          • 276
            Makes developing rest api's easy peasy
          • 156
            Easy setup, looks good
          • 144
            The best api workflow out there
          • 53
            It's the best
          • 53
            History feature
          • 44
            Adds real value to my workflow
          • 43
            Great interface that magically predicts your needs
          • 35
            The best in class app
          • 12
            Can save and share script
          • 10
            Fully featured without looking cluttered
          • 8
            Collections
          • 8
            Option to run scrips
          • 8
            Global/Environment Variables
          • 7
            Shareable Collections
          • 7
            Dead simple and useful. Excellent
          • 7
            Dark theme easy on the eyes
          • 6
            Awesome customer support
          • 6
            Great integration with newman
          • 5
            Documentation
          • 5
            Simple
          • 5
            The test script is useful
          • 4
            Saves responses
          • 4
            This has simplified my testing significantly
          • 4
            Makes testing API's as easy as 1,2,3
          • 4
            Easy as pie
          • 3
            API-network
          • 3
            I'd recommend it to everyone who works with apis
          • 3
            Mocking API calls with predefined response
          • 2
            Now supports GraphQL
          • 2
            Postman Runner CI Integration
          • 2
            Easy to setup, test and provides test storage
          • 2
            Continuous integration using newman
          • 2
            Pre-request Script and Test attributes are invaluable
          • 2
            Runner
          • 2
            Graph
          • 1
            <a href="http://fixbit.com/">useful tool</a>
          CONS OF POSTMAN
          • 10
            Stores credentials in HTTP
          • 9
            Bloated features and UI
          • 8
            Cumbersome to switch authentication tokens
          • 7
            Poor GraphQL support
          • 5
            Expensive
          • 3
            Not free after 5 users
          • 3
            Can't prompt for per-request variables
          • 1
            Import swagger
          • 1
            Support websocket
          • 1
            Import curl

          related Postman posts

          Noah Zoschke
          Engineering Manager at Segment · | 30 upvotes · 3M views

          We just launched the Segment Config API (try it out for yourself here) — a set of public REST APIs that enable you to manage your Segment configuration. A public API is only as good as its #documentation. For the API reference doc we are using Postman.

          Postman is an “API development environment”. You download the desktop app, and build API requests by URL and payload. Over time you can build up a set of requests and organize them into a “Postman Collection”. You can generalize a collection with “collection variables”. This allows you to parameterize things like username, password and workspace_name so a user can fill their own values in before making an API call. This makes it possible to use Postman for one-off API tasks instead of writing code.

          Then you can add Markdown content to the entire collection, a folder of related methods, and/or every API method to explain how the APIs work. You can publish a collection and easily share it with a URL.

          This turns Postman from a personal #API utility to full-blown public interactive API documentation. The result is a great looking web page with all the API calls, docs and sample requests and responses in one place. Check out the results here.

          Postman’s powers don’t end here. You can automate Postman with “test scripts” and have it periodically run a collection scripts as “monitors”. We now have #QA around all the APIs in public docs to make sure they are always correct

          Along the way we tried other techniques for documenting APIs like ReadMe.io or Swagger UI. These required a lot of effort to customize.

          Writing and maintaining a Postman collection takes some work, but the resulting documentation site, interactivity and API testing tools are well worth it.

          See more
          Simon Reymann
          Senior Fullstack Developer at QUANTUSflow Software GmbH · | 27 upvotes · 5.4M views

          Our whole Node.js backend stack consists of the following tools:

          • Lerna as a tool for multi package and multi repository management
          • npm as package manager
          • NestJS as Node.js framework
          • TypeScript as programming language
          • ExpressJS as web server
          • Swagger UI for visualizing and interacting with the API’s resources
          • Postman as a tool for API development
          • TypeORM as object relational mapping layer
          • JSON Web Token for access token management

          The main reason we have chosen Node.js over PHP is related to the following artifacts:

          • Made for the web and widely in use: Node.js is a software platform for developing server-side network services. Well-known projects that rely on Node.js include the blogging software Ghost, the project management tool Trello and the operating system WebOS. Node.js requires the JavaScript runtime environment V8, which was specially developed by Google for the popular Chrome browser. This guarantees a very resource-saving architecture, which qualifies Node.js especially for the operation of a web server. Ryan Dahl, the developer of Node.js, released the first stable version on May 27, 2009. He developed Node.js out of dissatisfaction with the possibilities that JavaScript offered at the time. The basic functionality of Node.js has been mapped with JavaScript since the first version, which can be expanded with a large number of different modules. The current package managers (npm or Yarn) for Node.js know more than 1,000,000 of these modules.
          • Fast server-side solutions: Node.js adopts the JavaScript "event-loop" to create non-blocking I/O applications that conveniently serve simultaneous events. With the standard available asynchronous processing within JavaScript/TypeScript, highly scalable, server-side solutions can be realized. The efficient use of the CPU and the RAM is maximized and more simultaneous requests can be processed than with conventional multi-thread servers.
          • A language along the entire stack: Widely used frameworks such as React or AngularJS or Vue.js, which we prefer, are written in JavaScript/TypeScript. If Node.js is now used on the server side, you can use all the advantages of a uniform script language throughout the entire application development. The same language in the back- and frontend simplifies the maintenance of the application and also the coordination within the development team.
          • Flexibility: Node.js sets very few strict dependencies, rules and guidelines and thus grants a high degree of flexibility in application development. There are no strict conventions so that the appropriate architecture, design structures, modules and features can be freely selected for the development.
          See more
          Postman logo

          Postman

          95.1K
          1.8K
          Only complete API development environment
          95.1K
          1.8K
          PROS OF POSTMAN
          • 490
            Easy to use
          • 369
            Great tool
          • 276
            Makes developing rest api's easy peasy
          • 156
            Easy setup, looks good
          • 144
            The best api workflow out there
          • 53
            It's the best
          • 53
            History feature
          • 44
            Adds real value to my workflow
          • 43
            Great interface that magically predicts your needs
          • 35
            The best in class app
          • 12
            Can save and share script
          • 10
            Fully featured without looking cluttered
          • 8
            Collections
          • 8
            Option to run scrips
          • 8
            Global/Environment Variables
          • 7
            Shareable Collections
          • 7
            Dead simple and useful. Excellent
          • 7
            Dark theme easy on the eyes
          • 6
            Awesome customer support
          • 6
            Great integration with newman
          • 5
            Documentation
          • 5
            Simple
          • 5
            The test script is useful
          • 4
            Saves responses
          • 4
            This has simplified my testing significantly
          • 4
            Makes testing API's as easy as 1,2,3
          • 4
            Easy as pie
          • 3
            API-network
          • 3
            I'd recommend it to everyone who works with apis
          • 3
            Mocking API calls with predefined response
          • 2
            Now supports GraphQL
          • 2
            Postman Runner CI Integration
          • 2
            Easy to setup, test and provides test storage
          • 2
            Continuous integration using newman
          • 2
            Pre-request Script and Test attributes are invaluable
          • 2
            Runner
          • 2
            Graph
          • 1
            <a href="http://fixbit.com/">useful tool</a>
          CONS OF POSTMAN
          • 10
            Stores credentials in HTTP
          • 9
            Bloated features and UI
          • 8
            Cumbersome to switch authentication tokens
          • 7
            Poor GraphQL support
          • 5
            Expensive
          • 3
            Not free after 5 users
          • 3
            Can't prompt for per-request variables
          • 1
            Import swagger
          • 1
            Support websocket
          • 1
            Import curl

          related Postman posts

          Noah Zoschke
          Engineering Manager at Segment · | 30 upvotes · 3M views

          We just launched the Segment Config API (try it out for yourself here) — a set of public REST APIs that enable you to manage your Segment configuration. A public API is only as good as its #documentation. For the API reference doc we are using Postman.

          Postman is an “API development environment”. You download the desktop app, and build API requests by URL and payload. Over time you can build up a set of requests and organize them into a “Postman Collection”. You can generalize a collection with “collection variables”. This allows you to parameterize things like username, password and workspace_name so a user can fill their own values in before making an API call. This makes it possible to use Postman for one-off API tasks instead of writing code.

          Then you can add Markdown content to the entire collection, a folder of related methods, and/or every API method to explain how the APIs work. You can publish a collection and easily share it with a URL.

          This turns Postman from a personal #API utility to full-blown public interactive API documentation. The result is a great looking web page with all the API calls, docs and sample requests and responses in one place. Check out the results here.

          Postman’s powers don’t end here. You can automate Postman with “test scripts” and have it periodically run a collection scripts as “monitors”. We now have #QA around all the APIs in public docs to make sure they are always correct

          Along the way we tried other techniques for documenting APIs like ReadMe.io or Swagger UI. These required a lot of effort to customize.

          Writing and maintaining a Postman collection takes some work, but the resulting documentation site, interactivity and API testing tools are well worth it.

          See more
          Simon Reymann
          Senior Fullstack Developer at QUANTUSflow Software GmbH · | 27 upvotes · 5.4M views

          Our whole Node.js backend stack consists of the following tools:

          • Lerna as a tool for multi package and multi repository management
          • npm as package manager
          • NestJS as Node.js framework
          • TypeScript as programming language
          • ExpressJS as web server
          • Swagger UI for visualizing and interacting with the API’s resources
          • Postman as a tool for API development
          • TypeORM as object relational mapping layer
          • JSON Web Token for access token management

          The main reason we have chosen Node.js over PHP is related to the following artifacts:

          • Made for the web and widely in use: Node.js is a software platform for developing server-side network services. Well-known projects that rely on Node.js include the blogging software Ghost, the project management tool Trello and the operating system WebOS. Node.js requires the JavaScript runtime environment V8, which was specially developed by Google for the popular Chrome browser. This guarantees a very resource-saving architecture, which qualifies Node.js especially for the operation of a web server. Ryan Dahl, the developer of Node.js, released the first stable version on May 27, 2009. He developed Node.js out of dissatisfaction with the possibilities that JavaScript offered at the time. The basic functionality of Node.js has been mapped with JavaScript since the first version, which can be expanded with a large number of different modules. The current package managers (npm or Yarn) for Node.js know more than 1,000,000 of these modules.
          • Fast server-side solutions: Node.js adopts the JavaScript "event-loop" to create non-blocking I/O applications that conveniently serve simultaneous events. With the standard available asynchronous processing within JavaScript/TypeScript, highly scalable, server-side solutions can be realized. The efficient use of the CPU and the RAM is maximized and more simultaneous requests can be processed than with conventional multi-thread servers.
          • A language along the entire stack: Widely used frameworks such as React or AngularJS or Vue.js, which we prefer, are written in JavaScript/TypeScript. If Node.js is now used on the server side, you can use all the advantages of a uniform script language throughout the entire application development. The same language in the back- and frontend simplifies the maintenance of the application and also the coordination within the development team.
          • Flexibility: Node.js sets very few strict dependencies, rules and guidelines and thus grants a high degree of flexibility in application development. There are no strict conventions so that the appropriate architecture, design structures, modules and features can be freely selected for the development.
          See more