Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/.htaccess/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 阿帕奇水槽。带有多路复用通道选择器的正则表达式提取器_Regex_Flume - Fatal编程技术网

Regex 阿帕奇水槽。带有多路复用通道选择器的正则表达式提取器

Regex 阿帕奇水槽。带有多路复用通道选择器的正则表达式提取器,regex,flume,Regex,Flume,我有以下Flume配置,用于获取具有特定数值的服务器日志条目,并将它们推送到相应的位置 卡夫卡主题 # Name the components on this agent a1.sources = r1 a1.channels = c2 c3 # Describe/configure the source a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /home/user/spoolFlume a1.sources.r1.fil

我有以下Flume配置,用于获取具有特定数值的服务器日志条目,并将它们推送到相应的位置 卡夫卡主题

# Name the components on this agent
a1.sources = r1
a1.channels = c2 c3

# Describe/configure the source
a1.sources.r1.type = spooldir 
a1.sources.r1.spoolDir = /home/user/spoolFlume
a1.sources.r1.fileSuffix = .DONE
a1.sources.r1.basenameHeader = true
a1.sources.r1.deserializer.maxLineLength = 8192

a1.sources.r1.interceptors = i1 
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = (2725391)
a1.sources.r1.interceptors.i1.serializers = id 
a1.sources.r1.interceptors.i1.serializers.id.name = project_id
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = project_id 
a1.sources.r1.selector.mapping.2725391 = c3
a1.sources.r1.selector.default = c2


a1.channels.c2.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.brokerList=kafka10.profile:9092,kafka11.profile:9092,kafka12.profile:9092
a1.channels.c2.topic = flume_test_002
a1.channels.c2.zookeeperConnect = kafka10.profile:2181,kafka11.profile:2181,kafka12.profile:2181
#default = true
a1.channels.c2.parseAsFlumeEvent = true

a1.channels.c3.type = org.apache.flume.channel.kafka.KafkaChannel                                       
a1.channels.c3.brokerList = kafka10.profile:9092,kafka11.profile:9092,kafka12.profile:9092
a1.channels.c3.topic = flume_test_003
a1.channels.c3.zookeeperConnect = kafka10.profile:2181,kafka11.profile:2181,kafka12.profile:2181
a1.channels.c3.parseAsFlumeEvent = true

# Bind the source and sink to the channel
a1.sources.r1.channels = c2 c3
我用更复杂的regexp做了一些测试,在
cat | grep-E
中看起来都不错,但是当我试图在Flume配置中使用它时,并不是所有的条目都被捕获

现在我使用一个单词regexp,但即使这样,也不会捕获所有条目,即并非所有“正确”条目都会转到kafka主题(例如,我在日志中有两个带有“2725391”的字符串,但经过处理后,我只能在kafka中看到一个条目)

Flume配置似乎有问题。如有任何建议,将不胜感激

更新2。更重要的是,当我使用短文件(少于100个字符串)进行解析时,一切都很好。对于2GB左右的文件,我遗漏了条目

更新3。我找到了解析所有条目的方法

a1.sources.r1.decodeErrorPolicy = IGNORE
这很有帮助,因为在Kafka通道中解析的事件的头中有一个奇怪的符号。我不知道它来自哪里,因为在处理之前,原始日志中没有这样的符号:/

basename00278388pid2725391�31.28.244.74

解决方案是设置JAVA_HOME适当的值并设置以下设置:

a1.sources.r1.decodeErrorPolicy = IGNORE
问题的根源是日志中的非UTF字符