Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark Python Avro Kafka反序列化程序_Python_Apache Spark_Apache Kafka_Avro_Spark Streaming - Fatal编程技术网

Spark Python Avro Kafka反序列化程序

Spark Python Avro Kafka反序列化程序,python,apache-spark,apache-kafka,avro,spark-streaming,Python,Apache Spark,Apache Kafka,Avro,Spark Streaming,我在PythonSpark应用程序中创建了一个kafka流,可以解析通过它传递的任何文本 kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1}) 我想将其更改为能够解析来自卡夫卡主题的avro消息。在解析文件中的avro消息时,我的操作如下: reader = DataFileReader(open("customer

我在PythonSpark应用程序中创建了一个kafka流,可以解析通过它传递的任何文本

            kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
我想将其更改为能够解析来自卡夫卡主题的avro消息。在解析文件中的avro消息时,我的操作如下:

            reader = DataFileReader(open("customer.avro", "r"), DatumReader())  
我不熟悉python和spark,如何更改流以能够解析avro消息?另外,在从卡夫卡读取Avro消息时,如何指定要使用的模式???我以前用java做过这一切,但python让我感到困惑

编辑:

我试着把avro解码器也包括进去

            kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1},valueDecoder=avro.io.DatumReader(schema))
但是我得到了以下错误

            TypeError: 'DatumReader' object is not callable

我也遇到了同样的挑战——在pyspark中反序列化来自Kafka的avro消息,并使用Confluent Schema Registry模块的Messageserializer方法解决了这个问题,就像在我们的例子中,模式存储在Confluent Schema Registry中一样

您可以在以下位置找到该模块:


显然,正如您所看到的,这段代码使用了新的直接方法,没有接收者,因此@Zoltan Fedor在评论中提到了createdDirectStream(参见更多内容)

,提供的答案现在有点旧了,因为自编写以来已经过去了2.5年。该库已经发展到支持与本机相同的功能。给定代码中唯一需要的技术如下

from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
然后,你可以改变这一行-

kvs = KafkaUtils.createDirectStream(ssc, ["mytopic"], {"metadata.broker.list": "xxxxx:9092,yyyyy:9092"}, valueDecoder=serializer.decode_message)

我已经测试过了,效果很好。我为将来可能需要它的任何人添加这个答案。

< P>如果您不考虑使用合并的模式注册表,在文本文件或DICT对象中有一个模式,可以使用Python包解码卡夫卡流的AVRO消息:

from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
import io
import fastavro

def decoder(msg):
    # here should be your schema
    schema = {
      "namespace": "...",
      "type": "...",
      "name": "...",
      "fields": [
        {
          "name": "...",
          "type": "..."
        },
      ...}
    bytes_io = io.BytesIO(msg)
    bytes_io.seek(0)
    msg_decoded = fastavro.schemaless_reader(bytes_io, schema)
    return msg_decoded

session = SparkSession.builder \
                      .appName("Kafka Spark Streaming Avro example") \
                      .getOrCreate()

streaming_context = StreamingContext(sparkContext=session.sparkContext,
                                     batchDuration=5)

kafka_stream = KafkaUtils.createDirectStream(ssc=streaming_context,
                                             topics=['your_topic_1', 'your_topic_2'],
                                             kafkaParams={"metadata.broker.list": "your_kafka_broker_1,your_kafka_broker_2"},
                                             valueDecoder=decoder)

你看到了什么错误?你提到的图书馆现在有点旧了,似乎没有得到维护。哈,是的,我在2.5年前提到过那个图书馆,当时它还“新鲜”。-)
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
import io
import fastavro

def decoder(msg):
    # here should be your schema
    schema = {
      "namespace": "...",
      "type": "...",
      "name": "...",
      "fields": [
        {
          "name": "...",
          "type": "..."
        },
      ...}
    bytes_io = io.BytesIO(msg)
    bytes_io.seek(0)
    msg_decoded = fastavro.schemaless_reader(bytes_io, schema)
    return msg_decoded

session = SparkSession.builder \
                      .appName("Kafka Spark Streaming Avro example") \
                      .getOrCreate()

streaming_context = StreamingContext(sparkContext=session.sparkContext,
                                     batchDuration=5)

kafka_stream = KafkaUtils.createDirectStream(ssc=streaming_context,
                                             topics=['your_topic_1', 'your_topic_2'],
                                             kafkaParams={"metadata.broker.list": "your_kafka_broker_1,your_kafka_broker_2"},
                                             valueDecoder=decoder)