Apache kafka 使用Kafka将apache服务器日志加载到HDFS_Apache Kafka_Camus

Apache kafka 使用Kafka将apache服务器日志加载到HDFS

apache-kafka

Apache kafka 使用Kafka将apache服务器日志加载到HDFS,apache-kafka,camus,Apache Kafka,Camus,我想使用Kafka将apache服务器日志加载到hdfs。创建主题： ./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew tail -f /var/log/httpd/access_log |./kafka-console-producer.sh --broker-list 10.25.3.207:6667 --topic

我想使用Kafka将apache服务器日志加载到hdfs。
创建主题：

./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew

tail -f  /var/log/httpd/access_log |./kafka-console-producer.sh --broker-list 10.25.3.207:6667 --topic lognew

./kafka-console-consumer.sh --zookeeper 10.25.3.207:2181 --topic lognew --from-beginning

# Needed Camus properties, more cleanup to come  
# final top-level data output directory, sub-directory will be dynamically      created for each topic pulled
etl.destination.path=/user/root/topics
# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files
etl.execution.base.path=/user/root/exec
# where completed Camus job output directories are kept, usually a sub-dir in the base.path
etl.execution.history.path=/user/root/camus/exec/history

# Kafka-0.8 handles all zookeeper calls
#zookeeper.hosts=
#zookeeper.broker.topics=/brokers/topics
#zookeeper.broker.nodes=/brokers/ids

# Concrete implementation of the Encoder class to use (used by Kafka Audit, and thus optional for now)    `camus.message.encoder.class=com.linkedin.camus.etl.kafka.coders.DummyKafkaMessageEncoder`

# Concrete implementation of the Decoder class to use
  #camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder

# Used by avro-based Decoders to use as their Schema Registry
 #kafka.message.coder.schema.registry.class=com.linkedin.camus.example.schemaregistry.DummySchemaRegistry

# Used by the committer to arrange .avro files into a partitioned scheme. This will be the default partitioner for all
# topic that do not have a partitioner specified
    #etl.partitioner.class=com.linkedin.camus.etl.kafka.coders.DefaultPartitioner

# Partitioners can also be set on a per-topic basis
#etl.partitioner.class.<topic-name>=com.your.custom.CustomPartitioner

# all files in this dir will be added to the distributed cache and placed on the classpath for hadoop tasks
# hdfs.default.classpath.dir=

# max hadoop tasks to use, each task can pull multiple topic partitions
mapred.map.tasks=30
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
# Max minutes for each mapper to pull messages (-1 means no limit)
kafka.max.pull.minutes.per.task=-1

# if whitelist has values, only whitelisted topic are pulled.  nothing on the blacklist is pulled
#kafka.blacklist.topics=
kafka.whitelist.topics=lognew
log4j.configuration=true

# Name of the client as seen by kafka
kafka.client.name=camus
# Fetch Request Parameters
#kafka.fetch.buffer.size=
#kafka.fetch.request.correlationid=
#kafka.fetch.request.max.wait=
#kafka.fetch.request.min.bytes=
# Connection parameters.
kafka.brokers=10.25.3.207:6667
#kafka.timeout.value=


#Stops the mapper from getting inundated with Decoder exceptions for the same topic
#Default value is set to 10
max.decoder.exceptions.to.print=5

#Controls the submitting of counts to Kafka
#Default value set to true
post.tracking.counts.to.kafka=true
monitoring.event.class=class.that.generates.record.to.submit.counts.to.kafka

# everything below this point can be ignored for the time being, will provide   more documentation down the road
##########################
etl.run.tracking.post=false
#kafka.monitor.tier=
#etl.counts.path=
kafka.monitor.time.granularity=10

etl.hourly=hourly
etl.daily=daily
etl.ignore.schema.errors=false

# configure output compression for deflate or snappy. Defaults to deflate
etl.output.codec=deflate
etl.deflate.level=6
#etl.output.codec=snappy

etl.default.timezone=America/Los_Angeles
etl.output.file.time.partition.mins=60
etl.keep.count.files=false
etl.execution.history.max.of.quota=.8

mapred.output.compress=true
mapred.map.max.attempts=1

kafka.client.buffer.size=20971520
kafka.client.so.timeout=60000

#zookeeper.session.timeout=
#zookeeper.connection.timeout=

跟踪apache访问日志目录：

./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew

tail -f  /var/log/httpd/access_log |./kafka-console-producer.sh --broker-list 10.25.3.207:6667 --topic lognew

./kafka-console-consumer.sh --zookeeper 10.25.3.207:2181 --topic lognew --from-beginning

# Needed Camus properties, more cleanup to come  
# final top-level data output directory, sub-directory will be dynamically      created for each topic pulled
etl.destination.path=/user/root/topics
# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files
etl.execution.base.path=/user/root/exec
# where completed Camus job output directories are kept, usually a sub-dir in the base.path
etl.execution.history.path=/user/root/camus/exec/history

# Kafka-0.8 handles all zookeeper calls
#zookeeper.hosts=
#zookeeper.broker.topics=/brokers/topics
#zookeeper.broker.nodes=/brokers/ids

# Concrete implementation of the Encoder class to use (used by Kafka Audit, and thus optional for now)    `camus.message.encoder.class=com.linkedin.camus.etl.kafka.coders.DummyKafkaMessageEncoder`

# Concrete implementation of the Decoder class to use
  #camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder

# Used by avro-based Decoders to use as their Schema Registry
 #kafka.message.coder.schema.registry.class=com.linkedin.camus.example.schemaregistry.DummySchemaRegistry

# Used by the committer to arrange .avro files into a partitioned scheme. This will be the default partitioner for all
# topic that do not have a partitioner specified
    #etl.partitioner.class=com.linkedin.camus.etl.kafka.coders.DefaultPartitioner

# Partitioners can also be set on a per-topic basis
#etl.partitioner.class.<topic-name>=com.your.custom.CustomPartitioner

# all files in this dir will be added to the distributed cache and placed on the classpath for hadoop tasks
# hdfs.default.classpath.dir=

# max hadoop tasks to use, each task can pull multiple topic partitions
mapred.map.tasks=30
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
# Max minutes for each mapper to pull messages (-1 means no limit)
kafka.max.pull.minutes.per.task=-1

# if whitelist has values, only whitelisted topic are pulled.  nothing on the blacklist is pulled
#kafka.blacklist.topics=
kafka.whitelist.topics=lognew
log4j.configuration=true

# Name of the client as seen by kafka
kafka.client.name=camus
# Fetch Request Parameters
#kafka.fetch.buffer.size=
#kafka.fetch.request.correlationid=
#kafka.fetch.request.max.wait=
#kafka.fetch.request.min.bytes=
# Connection parameters.
kafka.brokers=10.25.3.207:6667
#kafka.timeout.value=


#Stops the mapper from getting inundated with Decoder exceptions for the same topic
#Default value is set to 10
max.decoder.exceptions.to.print=5

#Controls the submitting of counts to Kafka
#Default value set to true
post.tracking.counts.to.kafka=true
monitoring.event.class=class.that.generates.record.to.submit.counts.to.kafka

# everything below this point can be ignored for the time being, will provide   more documentation down the road
##########################
etl.run.tracking.post=false
#kafka.monitor.tier=
#etl.counts.path=
kafka.monitor.time.granularity=10

etl.hourly=hourly
etl.daily=daily
etl.ignore.schema.errors=false

# configure output compression for deflate or snappy. Defaults to deflate
etl.output.codec=deflate
etl.deflate.level=6
#etl.output.codec=snappy

etl.default.timezone=America/Los_Angeles
etl.output.file.time.partition.mins=60
etl.keep.count.files=false
etl.execution.history.max.of.quota=.8

mapred.output.compress=true
mapred.map.max.attempts=1

kafka.client.buffer.size=20971520
kafka.client.so.timeout=60000

#zookeeper.session.timeout=
#zookeeper.connection.timeout=

在另一个终端（卡夫卡垃圾箱）启动消费者：

./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew

tail -f  /var/log/httpd/access_log |./kafka-console-producer.sh --broker-list 10.25.3.207:6667 --topic lognew

./kafka-console-consumer.sh --zookeeper 10.25.3.207:2181 --topic lognew --from-beginning

# Needed Camus properties, more cleanup to come  
# final top-level data output directory, sub-directory will be dynamically      created for each topic pulled
etl.destination.path=/user/root/topics
# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files
etl.execution.base.path=/user/root/exec
# where completed Camus job output directories are kept, usually a sub-dir in the base.path
etl.execution.history.path=/user/root/camus/exec/history

# Kafka-0.8 handles all zookeeper calls
#zookeeper.hosts=
#zookeeper.broker.topics=/brokers/topics
#zookeeper.broker.nodes=/brokers/ids

# Concrete implementation of the Encoder class to use (used by Kafka Audit, and thus optional for now)    `camus.message.encoder.class=com.linkedin.camus.etl.kafka.coders.DummyKafkaMessageEncoder`

# Concrete implementation of the Decoder class to use
  #camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder

# Used by avro-based Decoders to use as their Schema Registry
 #kafka.message.coder.schema.registry.class=com.linkedin.camus.example.schemaregistry.DummySchemaRegistry

# Used by the committer to arrange .avro files into a partitioned scheme. This will be the default partitioner for all
# topic that do not have a partitioner specified
    #etl.partitioner.class=com.linkedin.camus.etl.kafka.coders.DefaultPartitioner

# Partitioners can also be set on a per-topic basis
#etl.partitioner.class.<topic-name>=com.your.custom.CustomPartitioner

# all files in this dir will be added to the distributed cache and placed on the classpath for hadoop tasks
# hdfs.default.classpath.dir=

# max hadoop tasks to use, each task can pull multiple topic partitions
mapred.map.tasks=30
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
# Max minutes for each mapper to pull messages (-1 means no limit)
kafka.max.pull.minutes.per.task=-1

# if whitelist has values, only whitelisted topic are pulled.  nothing on the blacklist is pulled
#kafka.blacklist.topics=
kafka.whitelist.topics=lognew
log4j.configuration=true

# Name of the client as seen by kafka
kafka.client.name=camus
# Fetch Request Parameters
#kafka.fetch.buffer.size=
#kafka.fetch.request.correlationid=
#kafka.fetch.request.max.wait=
#kafka.fetch.request.min.bytes=
# Connection parameters.
kafka.brokers=10.25.3.207:6667
#kafka.timeout.value=


#Stops the mapper from getting inundated with Decoder exceptions for the same topic
#Default value is set to 10
max.decoder.exceptions.to.print=5

#Controls the submitting of counts to Kafka
#Default value set to true
post.tracking.counts.to.kafka=true
monitoring.event.class=class.that.generates.record.to.submit.counts.to.kafka

# everything below this point can be ignored for the time being, will provide   more documentation down the road
##########################
etl.run.tracking.post=false
#kafka.monitor.tier=
#etl.counts.path=
kafka.monitor.time.granularity=10

etl.hourly=hourly
etl.daily=daily
etl.ignore.schema.errors=false

# configure output compression for deflate or snappy. Defaults to deflate
etl.output.codec=deflate
etl.deflate.level=6
#etl.output.codec=snappy

etl.default.timezone=America/Los_Angeles
etl.output.file.time.partition.mins=60
etl.keep.count.files=false
etl.execution.history.max.of.quota=.8

mapred.output.compress=true
mapred.map.max.attempts=1

kafka.client.buffer.size=20971520
kafka.client.so.timeout=60000

#zookeeper.session.timeout=
#zookeeper.connection.timeout=

camus.properties文件配置为：

./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew

tail -f  /var/log/httpd/access_log |./kafka-console-producer.sh --broker-list 10.25.3.207:6667 --topic lognew

./kafka-console-consumer.sh --zookeeper 10.25.3.207:2181 --topic lognew --from-beginning

# Needed Camus properties, more cleanup to come  
# final top-level data output directory, sub-directory will be dynamically      created for each topic pulled
etl.destination.path=/user/root/topics
# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files
etl.execution.base.path=/user/root/exec
# where completed Camus job output directories are kept, usually a sub-dir in the base.path
etl.execution.history.path=/user/root/camus/exec/history

# Kafka-0.8 handles all zookeeper calls
#zookeeper.hosts=
#zookeeper.broker.topics=/brokers/topics
#zookeeper.broker.nodes=/brokers/ids

# Concrete implementation of the Encoder class to use (used by Kafka Audit, and thus optional for now)    `camus.message.encoder.class=com.linkedin.camus.etl.kafka.coders.DummyKafkaMessageEncoder`

# Concrete implementation of the Decoder class to use
  #camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder

# Used by avro-based Decoders to use as their Schema Registry
 #kafka.message.coder.schema.registry.class=com.linkedin.camus.example.schemaregistry.DummySchemaRegistry

# Used by the committer to arrange .avro files into a partitioned scheme. This will be the default partitioner for all
# topic that do not have a partitioner specified
    #etl.partitioner.class=com.linkedin.camus.etl.kafka.coders.DefaultPartitioner

# Partitioners can also be set on a per-topic basis
#etl.partitioner.class.<topic-name>=com.your.custom.CustomPartitioner

# all files in this dir will be added to the distributed cache and placed on the classpath for hadoop tasks
# hdfs.default.classpath.dir=

# max hadoop tasks to use, each task can pull multiple topic partitions
mapred.map.tasks=30
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
# Max minutes for each mapper to pull messages (-1 means no limit)
kafka.max.pull.minutes.per.task=-1

# if whitelist has values, only whitelisted topic are pulled.  nothing on the blacklist is pulled
#kafka.blacklist.topics=
kafka.whitelist.topics=lognew
log4j.configuration=true

# Name of the client as seen by kafka
kafka.client.name=camus
# Fetch Request Parameters
#kafka.fetch.buffer.size=
#kafka.fetch.request.correlationid=
#kafka.fetch.request.max.wait=
#kafka.fetch.request.min.bytes=
# Connection parameters.
kafka.brokers=10.25.3.207:6667
#kafka.timeout.value=


#Stops the mapper from getting inundated with Decoder exceptions for the same topic
#Default value is set to 10
max.decoder.exceptions.to.print=5

#Controls the submitting of counts to Kafka
#Default value set to true
post.tracking.counts.to.kafka=true
monitoring.event.class=class.that.generates.record.to.submit.counts.to.kafka

# everything below this point can be ignored for the time being, will provide   more documentation down the road
##########################
etl.run.tracking.post=false
#kafka.monitor.tier=
#etl.counts.path=
kafka.monitor.time.granularity=10

etl.hourly=hourly
etl.daily=daily
etl.ignore.schema.errors=false

# configure output compression for deflate or snappy. Defaults to deflate
etl.output.codec=deflate
etl.deflate.level=6
#etl.output.codec=snappy

etl.default.timezone=America/Los_Angeles
etl.output.file.time.partition.mins=60
etl.keep.count.files=false
etl.execution.history.max.of.quota=.8

mapred.output.compress=true
mapred.map.max.attempts=1

kafka.client.buffer.size=20971520
kafka.client.so.timeout=60000

#zookeeper.session.timeout=
#zookeeper.connection.timeout=

以下是错误：

[CamusJob] - Fetching metadata from broker 10.25.3.207:6667 with client id camus for 0 topic(s) []
[CamusJob] - failed to create decoder
com.linkedin.camus.coders.MessageDecoderException:     com.linkedin.camus.coders.MessageDecoderException:     java.lang.NullPointerException
    at     com.linkedin.camus.etl.kafka.coders.MessageDecoderFactory.createMessageDecoder(MessageDecoderFactory.java:28)
    at com.linkedin.camus.etl.kafka.mapred.EtlInputFormat.createMessageDecoder(EtlInputFormat.java:390)
    at com.linkedin.camus.etl.kafka.mapred.EtlInputFormat.getSplits(EtlInputFormat.java:264)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at com.linkedin.camus.etl.kafka.CamusJob.run(CamusJob.java:280)
    at com.linkedin.camus.etl.kafka.CamusJob.run(CamusJob.java:608)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at com.linkedin.camus.etl.kafka.CamusJob.main(CamusJob.java:572)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: com.linkedin.camus.coders.MessageDecoderException: java.lang.NullPointerException
    at com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder.init(KafkaAvroMessageDecoder.java:40)
    at com.linkedin.camus.etl.kafka.coders.MessageDecoderFactory.createMessageDecoder(MessageDecoderFactory.java:24)
    ... 22 more
Caused by: java.lang.NullPointerException
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:195)
    at     com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder.init(KafkaAvroMessageDecoder.java:31)
    ... 23 more
[CamusJob] - Discarding topic (Decoder generation failed) : avrotopic
[CamusJob] - failed to create decoder

请提出解决此问题的方法。提前谢谢

我从来没有用过加缪。但我相信这是一个与卡夫卡有关的错误，它与你如何编码/解码信息有关。我相信堆栈跟踪中的重要行是

Caused by: com.linkedin.camus.coders.MessageDecoderException: java.lang.NullPointerException
  at com.linkedin.camus.etl.kafka.coders.KafkaAvroMessageDecoder.init(KafkaAvroMessageDecoder.java:40)
  at com.linkedin.camus.etl.kafka.coders.MessageDecoderFactory.createMessageDecoder(MessageDecoderFactory.java:24)

你是怎么告诉卡夫卡使用你的Avro编码的？您已在配置中注释掉以下行

#kafka.message.coder.schema.registry.class=com.linkedin.camus.example.schemaregistry.DummySchemaRegistry

那么您是在代码的其他地方设置的吗？如果您没有，我建议您取消对该配置值的注释，并将其设置为您试图解码/编码的任何avro类

使用正确的类路径等可能需要一些调试，但我相信这是一个容易解决的问题

编辑在回应你的评论时，我有几点自己的评论

我从未使用过加缪。因此，调试您从加缪那里得到的错误不是我能做得很好或根本做不到的事情。因此，你必须花一些时间（可能几个小时）研究和尝试不同的东西，以使它发挥作用

我怀疑您需要的配置值是否正确。任何以Dummy开头的选项都可能不是有效的配置选项

在谷歌上搜索加缪和模式注册中心，发现了一些有趣的链接。这些更有可能是您需要的正确配置值。我猜，因为我从来没用过加缪

也可能对你有用。我不知道你是否看过。但如果你没有，我敢肯定，在进入堆栈溢出之前，谷歌搜索你得到的具体错误可能是你应该做的事情

我取消了对配置值的注释

kafka.message.coder.schema.registry.class=com.linkedin.camus.example.schemaregistry.DummySchemaRegistry

，但得到另一个错误：

com.linkedin.camus.coders.MessageDecoderException:java.lang.InstantiationException:com.linkedin.camus.example.schemaregistry.DummySchemaRegistry atcom.linkedin.camus.camus.etl.kafka.coders.MessageDecoderFactory.createMessageDecoder（MessageDecoderFactory.java:28）在com.linkedin.camus.etl.kafka.mapred.etInputFormat.createMessageDecoder（etInputFormat.java:390）在com.linkedin.camus.etl.kafka.mapred.etInputFormat.getSplits（etInputFormat.java:264）上在org.apache.hadoop.mapreduce.jobsmitter.writeNewSplits（jobsmitter.java:301）