Hadoop 水槽+；卡夫卡+；HDFS：拆分消息_Hadoop_Hdfs_Apache Kafka_Avro_Flume

Hadoop 水槽+；卡夫卡+；HDFS：拆分消息

hadoop apache-kafka

Hadoop 水槽+；卡夫卡+；HDFS：拆分消息,hadoop,hdfs,apache-kafka,avro,flume,Hadoop,Hdfs,Apache Kafka,Avro,Flume,我使用以下flume代理配置从kafka源读取消息并将其写回HDFS接收器 tier1.sources = source1 tier 1.channels = channel1 tier1.sinks = sink1 tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource tier1.sources.source1.zookeeperConnect = 192.168.0.100:2181 tier1.so

我使用以下flume代理配置从kafka源读取消息并将其写回HDFS接收器

tier1.sources  = source1
tier 1.channels = channel1
tier1.sinks = sink1

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = 192.168.0.100:2181
tier1.sources.source1.topic = test
tier1.sources.source1.groupId = flume
tier1.sources.source1.channels = channel1
tier1.sources.source1.interceptors = i1
tier1.sources.source1.interceptors.i1.type = timestamp
tier1.sources.source1.kafka.consumer.timeout.ms = 100

tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.brokerList = 192.168.0.100:9092

tier1.channels.channel1.topic = test
tier1.channels.channel1.zookeeperConnect = 192.168.0.100:2181/kafka
tier1.channels.channel1.parseAsFlumeEvent = false

tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.writeFormat = Text
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.hdfs.filePrefix = test-kafka
tier1.sinks.sink1.hdfs.fileSufix = .avro
tier1.sinks.sink1.hdfs.useLocalTimeStamp = true
tier1.sinks.sink1.hdfs.path = /tmp/kafka/%y-%m-%d
tier1.sinks.sink1.hdfs.rollCount=0
tier1.sinks.sink1.hdfs.rollSize=0

卡夫卡消息内容是avro数据，如果每个轮询周期只有一条卡夫卡消息到达，则该数据将正确序列化到文件中

当两条kafka消息到达同一批时，它们被分组在同一个HDFS文件中，因为avro消息包含两个schema+数据，结果文件包含schema+数据+schema+数据，导致它是无效的.avro文件

如何拆分avro事件以将不同的kafka消息拆分为写入不同的文件

谢谢

有一种方法：假设您将源kafka传入数据称为“SourceTopic”。您可以将自定义接收器注册到此“SourceTopic”

<FlumeNodeRole>.sinks.<your-sink>.type =net.my.package.CustomSink

.sinks..type=net.my.package.CustomSink

在CustomSink中，您可以编写一个方法来区分传入消息、拆分消息并重新发送到不同的“DestinationTopic”。这个“DestinationTopic”现在可以作为文件序列化的新flume源

有关管道衬砌水槽，请参阅以下链接：

一种方法：假设您将源kafka传入数据称为“SourceTopic”。您可以将自定义接收器注册到此“SourceTopic”

<FlumeNodeRole>.sinks.<your-sink>.type =net.my.package.CustomSink

.sinks..type=net.my.package.CustomSink

有关管道衬砌水槽，请参阅以下链接：