I need to build the Alert & Notification framework with the use of a scheduled program. We will analyze the events from the database table and filter events that are falling under a day timespan and send these event messages over email. Currently, we are using Kafka Pub/Sub for messaging. The customer wants us to move on Apache Flink, I am trying to understand how Apache Flink could be fit better for us.
I recommend Apache Flink because it is the pro tool for everybody who has a serious stream processing use case. Flink is used by huge companies, such as Uber, Alibaba or Netflix. AWS is offering Flink as a hosted service. The reason for these companies to decide for Flink are manyfold: Flink offers great performance, support for very large state, exactly-once processing semantics, different APIs (with SQL growing a lot lately), ... Flink supports a many different deployment models, including Kubernetes, Hadoop YARN or custom deployments.
The drawbacks of Apache Flink are medium steep learning curve, and plenty of options (APIs, deployment models, state backends, ...)
These are my personal views, and I have a bias towards Flink, because I've worked a lot on it:
Flink and Kafka (the message bus) work together very well, and that's also the most popular combination (I'm guessing). There's also Kafka Streams, a stream processing library using Kafka (the message bus) as a data transport layer. Some considerations of Kafka Streams vs Flink:
- KStreams has a hard dependency on Kafka, Flink is independent of the message bus, and can easily read and write to many systems (KStreams requires Kafka connect for that)
- Since KStreams is doing data exchange via kafka topics, there's a lot of load on the Kafka cluster (size it appropriately). Monitoring becomes difficult as processing and data storage are in the same cluster. Do you really want your production data being discarded because your processing is eating up all your IO?
- Flink is the older project, it has been battle tested for many years across a lot of different scenarios. There's more libraries, such as a CEP (Complex Event Processing) library and more and more machine learning integrations.