flink

Posted by neverset on April 17, 2021

Flink is an open-source framework for stateful, large-scale, distributed, and fault-tolerant stream processing. Apache Flink allows to ingest massive streaming data (up to several terabytes) from different sources and process it in a distributed fashion way across multiple nodes, before pushing the derived streams to other services or applications such as Apache Kafka, DBs, and Elastic search. Flink enables real-time data analytics on streaming data and fits well for continuous Extract-transform-load (ETL) pipelines on streaming data and for event-driven applications as well. Flink offers multiple operations on data streams or sets such as mapping, filtering, grouping, updating state, joining, defining windows, and aggregating. Features:

  1. Robust Stateful Stream Processing: handle business logic that requires a contextual state while processing the data streams using its DataStream API at any scale
  2. Fault Tolerance: state recovery from faults based on a periodic and asynchronous checkpointing
  3. Exactly-Once Consistency Semantics: each event in the stream is delivered and processed exactly once in case of failures
  4. Scalability
  5. In-Memory Performance: perform all computations by accessing local, often in-memory, state yielding very low processing latencies
  6. seamless connectivity to data sources including Apache Kafka, Elasticsearch, Apache Cassandra, Kinesis
  7. Flexible deployment: can be deployed on various clusters environments (YARN, Apache Mesos, and Kubernetes)
  8. Complex Event Processing (CEP ) library to detect patterns (i.e., event sequences) in data streams
  9. Fluent APIs in Java and Scala
  10. Flink is a true streaming engine comparing to the micro-batch processing model of Spark Streaming

Data Abstraction

  1. DataStream
  2. DataSet

backend storage

  1. Memory State Backend
  2. File System Backend
  3. RocksDB backend