Flume

Posted by neverset on September 5, 2020

Apache Flume is an open-source, powerful, reliable and flexible system used to collect, aggregate and move large amounts of unstructured data from multiple data sources into HDFS/Hbase (for example) in a distributed fashion via it’s strong coupling with the Hadoop cluster.

Flume is dying or dead, because of

  • Performance The processing speeds of Apache Flume is dependent on the buffer channel, which realistically cannot process upwards of a million events in most use-cases
  • Order of events There is no guarantee that the events would be received by the sink in the same order it was sent from the client, which for some use-cases is absolutely unacceptable.
  • Duplicate data There is a chance that some of the events are duplicated and sent more than once, that can at best be ignored, but on the other hand, can be disastrous.

flume OG(old generation)

flume OG utilizes architecture in layer: Agent, Collector and Storage

Agent

agent collect data to generate data flow

  1. collect data
    • push sources extern system push data to flume
    • polling sources flume collects data from extern system with polling

      Collector

      collector aggregate data

      Storage

      Storage is an storage system, it can be file, hdfs, hive, hbase or other distributed storage system

      Master

      master coordinate and configure agent and collector flume implement multi-master and ZooKeeper to keep configuration consistency and high availability

      flume NG(new generation)

      consist of source, sink and channel

  2. basic components

source

extract data from data geenrator and send data to one or more channel in flume event format. The extraction process can be event-driven or through frequent polling and is conducted by source-runners.

channel

it is a temporary storage container. The type of channel used is based upon a single factor, i.e. event reliability. Flume follows a transactional approach and doesn’t remove the event from the channel until it is moved into the sink and has received a confirmation from the subsequent sink runner.

  • MemoryChannel higher performance but are also highly volatile and easily susceptible to failures
  • JDBC performance is slowest but it provides the highest security against failures
  • FileChannel it save data to disks to keep data completeness and consistency, it has moderate performance and is moderately secure against failures.

sink

send data to storage system (HDFS, Hbase, Hive, Elasticsearch etc.) in specified time interval

  1. Fan In vs Fan Out 1) Fan In consolidate data from multiple sources on-the-fly

2) Fan Out bifurcation of the flow. The fan-out feature hast two options, i.e. replicating and multiplexing.

  • replicating a source’s event is copied across multi channels and pushed downstream into corresponding sinks
  • multiplexing multiplexing is like filtering, for example, if event is of type “Error” then push event to channel 1 and 3, else channel 2