關(guān)于Flume,官方定義如下:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
Flume是分布式海量日志收集工具,根據(jù)不同的數(shù)據(jù)來源,F(xiàn)lume并不局限于對日志的收集。
flume有如下特性:
內(nèi)置對多種source和目標(biāo)類型的支持
支持水平擴(kuò)展
支持多種傳輸方式,例如:multi-hop flows, fan-in fan-out flows, ****...
支持contextual routing
支持?jǐn)r截器
可靠傳遞。在flume中每個(gè)事件有兩個(gè)事務(wù),分別在send和receive階段。 sender發(fā)送事件給receiver。接收到數(shù)據(jù)后,receiver提交自己的事務(wù)并發(fā)送一個(gè)成功信號給sender。sender收到該信號后提交自己的事務(wù)。
話說Flume最初是為了從多個(gè)web服務(wù)把數(shù)據(jù)流復(fù)制到HDFS而設(shè)計(jì)的,那為什么不直接用put把數(shù)據(jù)放到HDFS? 假如我們有對快速增長的數(shù)據(jù)進(jìn)行實(shí)時(shí)分析的需求,put過來的數(shù)據(jù)已經(jīng)不是實(shí)時(shí)的了。
同樣的,rsync