I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?
For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?
I have been using Airflow for more than 2 years now and haven't thought about moving to any other platform. Coming back to your requirements, Airflow fits pretty well. 1. It has an excellent way to manage dependent tasks using DAG (Direct Acyclic Graph), You can create a DAG with tasks and manage which task is dependent on which and Airflow takes care of running it or not running a task in case the parent task fails. 2. Integrations - The airflow community has implemented various integration to different cloud services, to Hadoop, spark a and as well as Jira. Though it doesn't have in-built integration for Informatica you can also run your own service in Airflow as a task (which can handle all Informatica related operations).
- It's very easy to find/monitor and manage Jobs/Pipelines as Airflow provides a great consolidated UI.