Airflow
We are a young start-up with 2 developers and a team in India looking to choose our next ETL tool. We have a few processes in Azure Data Factory but are looking to switch to a better platform. We were debating Trifacta and Airflow. Or even staying with Azure Data Factory. The use case will be to feed data to front-end APIs.
We're looking to do a project for a company that has incoming data from 2 sources, namely MongoDB and MySQL. We need to make it such that we are combining data from these 2 sources and showing it in real-time to PostgreSQL. Ideally, about 600,000 records per day. Which tool would be better for this use case? Airflow or Kafka?
For getting the data out of MySQL and MongoDB, I would recommend using the Flink CDC connector (https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc.html and https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mongodb-cdc.html). With Apache Flink's JDBC Sink connector, you'll be able to send the data to PostgreSQL: https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/jdbc/.
You'll be able to run all this in a single Flink job (e.g. getting the data from 2 sources and sending it to one destination), no Kafka or Airflow required.
If you are looking for a cloud SaaS solution, you could also check out the offering of my company: https://www.decodable.co/. In decodable you can define all these sources and sinks using a GUI, and our platform takes care of running it for you. You can even try it for free.
According to the Airflow documentation "Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!).", so you won't be able to provide real-time data. You can use Kafka combined with some Spark scripts. Another way that you can think of is using a change data capture (CDC) tool as Airbyte or Debezium combined to Kafka.
We have some lambdas we need to orchestrate to get our workflow going. In the past, we already attempted to use Airflow as the orchestrator, but the need to coordinate the tasks in a database generates an overhead that we cannot afford. For our use case, there are hundreds of inputs per minute and we need to scale to support all the inputs and have an efficient way to analyze them later. The ideal product would be AWS Step Functions since it can manage our load demand graciously, but it is too expensive and we cannot afford that. So, I would like to get alternatives for an orchestrator that does not need a complex backend, can manage hundreds of inputs per minute, and is not too expensive.
I think the problem is that you need some kind of "realtime orchestration" – and thus completely different tools designed for that. It seems like an product development problem – probably you need to use some message queue (RabbitMQ, Kafka) and own processing gateway service (maybe another Lambda) that'll react to those inputs.
I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Another option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with AWS Step Functions but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Do Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
On one side SF could be compared to Airflow: same distributed flow control, but there is a difference: SF is more autonomous: you could specify retry rules and so on, but only before you start execution. It is not possible to have same level of manual control as in Airflow. Also there limitations on number of executed activities. Not small, but still restrictive. I think if you have complex not fully automated process Airflow still good for you. Thinking about your scale, you just need to spawn and control Batch - this fits perfectly for airflow steps. This is my personal opinion.
I am looking for the best tool to orchestrate #ETL workflows in non-Hadoop environments, mainly for regression testing use cases. Would Airflow or Apache NiFi be a good fit for this purpose?
For example, I want to run an Informatica ETL job and then run an SQL task as a dependency, followed by another task from Jira. What tool is best suited to set up such a pipeline?
I have been using Airflow for more than 2 years now and haven't thought about moving to any other platform. Coming back to your requirements, Airflow fits pretty well. 1. It has an excellent way to manage dependent tasks using DAG (Direct Acyclic Graph), You can create a DAG with tasks and manage which task is dependent on which and Airflow takes care of running it or not running a task in case the parent task fails. 2. Integrations - The airflow community has implemented various integration to different cloud services, to Hadoop, spark a and as well as Jira. Though it doesn't have in-built integration for Informatica you can also run your own service in Airflow as a task (which can handle all Informatica related operations).
- It's very easy to find/monitor and manage Jobs/Pipelines as Airflow provides a great consolidated UI.
Hey Sathya! With Airflow, you are able to create custom hooks and operators to trigger various types of jobs. There may be ones that exist already for informatica, but I am unsure. Would be happy to connect to discuss further if you are interested. josh@astronomer.io
I am looking for an open-source scheduler tool with cross-functional application dependencies. Some of the tasks I am looking to schedule are as follows:
- Trigger Matillion ETL loads
- Trigger Attunity Replication tasks that have downstream ETL loads
- Trigger Golden gate Replication Tasks
- Shell scripts, wrappers, file watchers
- Event-driven schedules
I have used Airflow in the past, and I know we need to create DAGs for each pipeline. I am not familiar with Jenkins, but I know it works with configuration without much underlying code. I want to evaluate both and appreciate any advise
Hi Teja, Jenkins is more a CI/CD tool for triggering build/test and other CD tasks, from what you're describing you may be able to get along with #Jenkins and add lot's of plugins and create pipelines. But, eventually you're going to need to know #Groovy language to orchestrate all those tasks which is going to be similar to what you do with #Airflow, So, IMHO , Airflow is more for production scheduled tasks and Jenkins is more for CI/CD non-production tasks.