Ftp 如何使flume将文件加载到hdfs,hdfs从不关闭file.tmp并按名称重命名文件。

Ftp 如何使flume将文件加载到hdfs,hdfs从不关闭file.tmp并按名称重命名文件。,ftp,hdfs,sftp,hadoop2,flume,Ftp,Hdfs,Sftp,Hadoop2,Flume,实际上我有两个问题,第一个问题是:在flume代理刷新整个文件后,如何使HDFS关闭文件(例如.123456789.tmp)。 事实上,文件从未关闭,直到我强制flume代理停止。 我相信有一种方法使用以下4个参数: hdfs.rollSize = 0 hdfs.rollCount =0 hdfs.rollInterval = 0 hdfs.batchsize = 1000000 agent.sources = r1 agent.channels = c1 agent.sinks =

实际上我有两个问题,第一个问题是:在flume代理刷新整个文件后,如何使HDFS关闭文件(例如.123456789.tmp)。 事实上,文件从未关闭,直到我强制flume代理停止。 我相信有一种方法使用以下4个参数:

hdfs.rollSize = 0 
hdfs.rollCount =0
hdfs.rollInterval = 0
hdfs.batchsize =    1000000
agent.sources = r1 
agent.channels = c1
agent.sinks = k

configure ftp source

agent.sources.r1.type = org.keedio.flume.source.mra.source.Source
agent.sources.r1.client.source = sftp
agent.sources.r1.name.server = ip
agent.sources.r1.user = user
agent.sources.r1.password = secret
agent.sources.r1.port = 22
agent.sources.r1.knownHosts = ~/.ssh/known_hosts
agent.sources.r1.work.dir = /DATA/test/flumrFTP
agent.sources.r1.fileHeader = true
agent.sources.r1.basenameHeader = true
agent.sources.r1.inputCharset = ISO-8859-1
#agent.sources.r1.batchSize = 1000
agent.sources.r1.flushlines = true

configure sink s1
agent.sinks.k.type = hdfs
agent.sinks.k.hdfs.path =  hdfs://hostname:8000/user/admin/DATA/import_flume/
agent.sinks.k.hdfs.filePrefix = %{basename}
agent.sinks.k.hdfs.rollCount = 0
agent.sinks.k.hdfs.rollInterval = 0
agent.sinks.k.hdfs.rollSize = 0
agent.sinks.k.hdfs.useLocalTimeStamp = true
agent.sinks.k.hdfs.batchsize =    1000000
agent.sinks.k.hdfs.fileType = DataStream

Use a channel which buffers events in memory
agent.channels.c1.type = memory
agent.channels.c1.capacity =  1000000
agent.channels.c1.transactionCapacity =   1000000

agent.sources.r1.channels = c1
agent.sinks.k.channel = c1
我的第二个问题是,我的代理flume从SFTP服务器接收文件,而我需要将每个文件名保存在hdfs中。它适用于spooldir类型,但不适用于SFTP!!有什么好主意吗

我的flume agent配置文件如下所示:

hdfs.rollSize = 0 
hdfs.rollCount =0
hdfs.rollInterval = 0
hdfs.batchsize =    1000000
agent.sources = r1 
agent.channels = c1
agent.sinks = k

configure ftp source

agent.sources.r1.type = org.keedio.flume.source.mra.source.Source
agent.sources.r1.client.source = sftp
agent.sources.r1.name.server = ip
agent.sources.r1.user = user
agent.sources.r1.password = secret
agent.sources.r1.port = 22
agent.sources.r1.knownHosts = ~/.ssh/known_hosts
agent.sources.r1.work.dir = /DATA/test/flumrFTP
agent.sources.r1.fileHeader = true
agent.sources.r1.basenameHeader = true
agent.sources.r1.inputCharset = ISO-8859-1
#agent.sources.r1.batchSize = 1000
agent.sources.r1.flushlines = true

configure sink s1
agent.sinks.k.type = hdfs
agent.sinks.k.hdfs.path =  hdfs://hostname:8000/user/admin/DATA/import_flume/
agent.sinks.k.hdfs.filePrefix = %{basename}
agent.sinks.k.hdfs.rollCount = 0
agent.sinks.k.hdfs.rollInterval = 0
agent.sinks.k.hdfs.rollSize = 0
agent.sinks.k.hdfs.useLocalTimeStamp = true
agent.sinks.k.hdfs.batchsize =    1000000
agent.sinks.k.hdfs.fileType = DataStream

Use a channel which buffers events in memory
agent.channels.c1.type = memory
agent.channels.c1.capacity =  1000000
agent.channels.c1.transactionCapacity =   1000000

agent.sources.r1.channels = c1
agent.sinks.k.channel = c1

尝试设置变量

hdfs.rollInterval是滚动当前文件之前等待的秒数


此设置在您设置的秒数后关闭文件。我将我的设置为200秒,我正在加载较小的文件

尝试设置变量

hdfs.rollInterval是滚动当前文件之前等待的秒数

此设置在您设置的秒数后关闭文件。我将我的设置为200秒,我正在加载较小的文件