We currently have 2 Kafka Streams topics that have records coming in continuously. We're looking into joining the 2 streams based on a key with a window of 5 minutes based on their timestamp.
Should I consider kStream - kStream join or Apache Flink window joins? Or is there any other better way to achieve this?
Hi !
I would say it depends on three things:
- the volume (number of events by seconds)
- the versatility you need.
- the vendor lock on
If volume is an issue then maybe flink may be a better option. Flink is distributed and is architectured as such.
If you are ok with having everything based on Kafka then maybe Kafka streams is a better option, but if at some point you need to integrate with several systems maybe you should consider flink.
If you are okay with sticking with confluent tools it’s okay, but beware of technology. Flink is more generalist than Kafka streams.
Hi, If you are already working with Kafka Streams, I suggest to continue with that otherwise there is capability that not exists there or not fitting your needs..
Thank you for your input. We're talking ~ 6mil records per minute so a significant amount of data. We also want to join records within a small window and push them to a different topic. We're okay with something out of confluent so looks like Fink is a better option here.
Hi I would recommend to continue with Kafka tools. Kafka Streams integrates seamlessly with Kafka, and you also get the benefits of automatic schema management if you choose to use Schema Registry. I see that some recommend Flink with high volumes, but Kafka Streams is perfectly capable of doing high volumes and windowed operations as well. I guess the vendor lock in, that is also mentioned, is probably done equally on the messaging side as with your processing application.
Further details can be found here: https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/
Good luck with your choice!