We're looking to do a project for a company that has incoming data from 2 sources, namely MongoDB and MySQL. We need to make it such that we are combining data from these 2 sources and showing it in real-time to PostgreSQL. Ideally, about 600,000 records per day. Which tool would be better for this use case? Airflow or Kafka?
For getting the data out of MySQL and MongoDB, I would recommend using the Flink CDC connector (https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc.html and https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mongodb-cdc.html). With Apache Flink's JDBC Sink connector, you'll be able to send the data to PostgreSQL: https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/jdbc/.
You'll be able to run all this in a single Flink job (e.g. getting the data from 2 sources and sending it to one destination), no Kafka or Airflow required.
If you are looking for a cloud SaaS solution, you could also check out the offering of my company: https://www.decodable.co/. In decodable you can define all these sources and sinks using a GUI, and our platform takes care of running it for you. You can even try it for free.
According to the Airflow documentation "Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!).", so you won't be able to provide real-time data. You can use Kafka combined with some Spark scripts. Another way that you can think of is using a change data capture (CDC) tool as Airbyte or Debezium combined to Kafka.