python kafka库的编码/格式问题

python kafka库的编码/格式问题,python,apache-kafka,kafka-python,Python,Apache Kafka,Kafka Python,我已经尝试使用这个库一段时间了,但找不到制片人 经过一点研究,我发现kafka向消费者发送(我猜也是这样)一个额外的5字节头(一个0字节,一个长的包含schema registry的schema id)。我已经设法通过简单地剥离第一个字节来让消费者工作 当我写制片人的时候,我应该预先准备一个类似的标题吗 下面是出现的例外情况: [2016-09-14 13:32:48,684] ERROR Task hdfs-sink-0 threw an uncaught and unrecovera

我已经尝试使用这个库一段时间了,但找不到制片人

经过一点研究,我发现kafka向消费者发送(我猜也是这样)一个额外的5字节头(一个0字节,一个长的包含schema registry的schema id)。我已经设法通过简单地剥离第一个字节来让消费者工作

当我写制片人的时候,我应该预先准备一个类似的标题吗

下面是出现的例外情况:

    [2016-09-14 13:32:48,684] ERROR Task hdfs-sink-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:142)
org.apache.kafka.connect.errors.DataException: Failed to deserialize data to Avro: 
    at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:109)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:357)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:226)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:170)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:142)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
    Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
我正在使用卡夫卡和python卡夫卡的最新稳定版本

编辑

消费者

制作人


由于您使用BinaryDecoder和DatumReader进行读取,因此,如果您以相反的方式发送数据(使用DatumWriter和BinaryEncoder作为编码器),我想您的消息会很好

大概是这样的:

制作人

from kafka import KafkaProducer
import avro.schema
import io
from avro.io import DatumWriter, BinaryEncoder
producer = KafkaProducer(bootstrap_servers="hadoop-master")

# Kafka topic
topic = "hadoop_00"

# Path to user.avsc avro schema
schema_path = "resources/f1.avsc"
schema = avro.schema.parse(open(schema_path).read())
# range is a bad variable name. I changed it here
value_range = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for i in value_range:
    datum_writer = DatumWriter(schema)
    byte_writer = io.BytesIO()
    datum_encoder = BinaryEncoder(byte_writer)
    datum_writer.write({"f1" : "value_%d" % (i)}, datum_encoder)
    producer.send(topic, byte_writer.getvalue())
我所做的几项修改是:

  • 使用DatumWriter和二进制编码器
  • 我不是在发送json,而是在字节流中发送一个字典(您可能需要使用普通字典检查代码,它可能也可以工作;但我不确定)
  • 使用字节流将消息发送到kafka主题(对我来说,有时失败,在这种情况下,我将.getvalue方法分配给一个变量,并使用producer.send中的变量。我不知道失败的原因,但分配给变量总是有效的)

我无法测试我添加的代码。但这就是我以前使用avro时编写的代码。如果对你不起作用,请在评论中告诉我。可能是因为我的记忆力不好。一旦我到家测试代码,我会用一个有效的答案更新这个答案。

我可以让python制作人通过Schema Registry向Kafka Connect发送消息:

...
import avro.datafile
import avro.io
import avro.schema
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='kafka:9092')
with open('schema.avsc') as f:
    schema = avro.schema.Parse(f.read())

def post_message():
    bytes_writer = io.BytesIO()
    # Write the Confluent "Magic Byte"
    bytes_writer.write(bytes([0]))
    # Should get or create the schema version with Schema-Registry
    ...
    schema_version = 1
    bytes_writer.write(
        int.to_bytes(schema_version, 4, byteorder='big'))

    # and then the standard Avro bytes serialization
    writer = avro.io.DatumWriter(schema)
    encoder = avro.io.BinaryEncoder(bytes_writer)
    writer.write({'key': 'value'}, encoder)
    producer.send('topic', value=bytes_writer.getvalue())
关于“魔法字节”的文档:

请填写生产商和消费者的代码。把所有事情都集中起来会有很大帮助。@thiruvenkadam,好了,你好!谢谢你的帮助。遗憾的是,我对它进行了测试,得到了关于这个“神奇字节”的相同异常。此外,我现在使用kafka java api用java编写了一个小生产者,我得到了完全相同的错误。我必须更新到python 3.5,并使用avro-python3库使其工作,谢谢!
from kafka import KafkaProducer
import avro.schema
import io
from avro.io import DatumWriter, BinaryEncoder
producer = KafkaProducer(bootstrap_servers="hadoop-master")

# Kafka topic
topic = "hadoop_00"

# Path to user.avsc avro schema
schema_path = "resources/f1.avsc"
schema = avro.schema.parse(open(schema_path).read())
# range is a bad variable name. I changed it here
value_range = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for i in value_range:
    datum_writer = DatumWriter(schema)
    byte_writer = io.BytesIO()
    datum_encoder = BinaryEncoder(byte_writer)
    datum_writer.write({"f1" : "value_%d" % (i)}, datum_encoder)
    producer.send(topic, byte_writer.getvalue())
...
import avro.datafile
import avro.io
import avro.schema
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='kafka:9092')
with open('schema.avsc') as f:
    schema = avro.schema.Parse(f.read())

def post_message():
    bytes_writer = io.BytesIO()
    # Write the Confluent "Magic Byte"
    bytes_writer.write(bytes([0]))
    # Should get or create the schema version with Schema-Registry
    ...
    schema_version = 1
    bytes_writer.write(
        int.to_bytes(schema_version, 4, byteorder='big'))

    # and then the standard Avro bytes serialization
    writer = avro.io.DatumWriter(schema)
    encoder = avro.io.BinaryEncoder(bytes_writer)
    writer.write({'key': 'value'}, encoder)
    producer.send('topic', value=bytes_writer.getvalue())