Flume

Apache Flume is an open-source, powerful, reliable and flexible system used to collect, aggregate and move large amounts of unstructured data from multiple data sources into HDFS/Hbase (for example) in a distributed fashion via it’s strong coupling with the Hadoop cluster.

Flume is dying or dead, because of

Performance The processing speeds of Apache Flume is dependent on the buffer channel, which realistically cannot process upwards of a million events in most use-cases
Order of events There is no guarantee that the events would be received by the sink in the same order it was sent from the client, which for some use-cases is absolutely unacceptable.
Duplicate data There is a chance that some of the events are duplicated and sent more than once, that can at best be ignored, but on the other hand, can be disastrous.

flume OG(old generation)

flume OG utilizes architecture in layer: Agent, Collector and Storage

Agent

agent collect data to generate data flow

collect data
- push sources extern system push data to flume
- polling sources flume collects data from extern system with polling
  Collector
  
  collector aggregate data
  
  Storage
  
  Storage is an storage system, it can be file, hdfs, hive, hbase or other distributed storage system
  
  Master
  
  master coordinate and configure agent and collector flume implement multi-master and ZooKeeper to keep configuration consistency and high availability
  
  flume NG(new generation)
  
  consist of source, sink and channel
basic components

source

extract data from data geenrator and send data to one or more channel in flume event format. The extraction process can be event-driven or through frequent polling and is conducted by source-runners.

channel

it is a temporary storage container. The type of channel used is based upon a single factor, i.e. event reliability. Flume follows a transactional approach and doesn’t remove the event from the channel until it is moved into the sink and has received a confirmation from the subsequent sink runner.

MemoryChannel higher performance but are also highly volatile and easily susceptible to failures
JDBC performance is slowest but it provides the highest security against failures
FileChannel it save data to disks to keep data completeness and consistency, it has moderate performance and is moderately secure against failures.

sink

send data to storage system (HDFS, Hbase, Hive, Elasticsearch etc.) in specified time interval

Fan In vs Fan Out 1) Fan In consolidate data from multiple sources on-the-fly

2) Fan Out bifurcation of the flow. The fan-out feature hast two options, i.e. replicating and multiplexing.

replicating a source’s event is copied across multi channels and pushed downstream into corresponding sinks
multiplexing multiplexing is like filtering, for example, if event is of type “Error” then push event to channel 1 and 3, else channel 2

flume OG(old generation)

Agent

Collector

Storage

Master

flume NG(new generation)

source

channel

sink

CATALOG

FEATURED TAGS