Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何从pyspark中的kafka以字符串格式从合流模式注册表获取Avro数据?_Apache Spark_Apache Kafka_Avro_Spark Structured Streaming_Confluent Schema Registry - Fatal编程技术网

Apache spark 如何从pyspark中的kafka以字符串格式从合流模式注册表获取Avro数据?

Apache spark 如何从pyspark中的kafka以字符串格式从合流模式注册表获取Avro数据?,apache-spark,apache-kafka,avro,spark-structured-streaming,confluent-schema-registry,Apache Spark,Apache Kafka,Avro,Spark Structured Streaming,Confluent Schema Registry,我正在spark(结构化流媒体)中读取来自Kafka的数据,但spark中来自Kafka的数据不是字符串格式。 火花:2.3.4 卡夫卡数据格式: {"Patient_ID":316,"Name":"Richa","MobileNo":{"long":7049123177},"BDate":{"int":740},"Gender":"female"} 以下是卡夫卡激发结构化流媒体的代码: # spark-submit --jars kafka-clients-0.10.0.1.jar --p

我正在spark(结构化流媒体)中读取来自Kafka的数据,但spark中来自Kafka的数据不是字符串格式。 火花:2.3.4

卡夫卡数据格式:

{"Patient_ID":316,"Name":"Richa","MobileNo":{"long":7049123177},"BDate":{"int":740},"Gender":"female"}
以下是卡夫卡激发结构化流媒体的代码:

#  spark-submit --jars kafka-clients-0.10.0.1.jar --packages org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0,org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.3.4,org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 /home/kinjalpatel/kafka_sppark.py
import pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import json
from pyspark.sql.functions import from_json, col, struct
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from pyspark.sql.column import Column, _to_java_column

sc = SparkContext()
sc.setLogLevel("ERROR")
spark = SparkSession(sc)
schema_registry_client = CachedSchemaRegistryClient(
url='http://localhost:8081')
serializer = MessageSerializer(schema_registry_client)
df = spark.readStream.format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "mysql-01-Patient") \
  .option("partition.assignment.strategy", "range") \
  .option("valueConverter", "org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter") \
  .load()
df.printSchema()
mta_stream=df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(timestamp AS STRING)", "CAST(timestampType AS STRING)")
mta_stream.printSchema()
qry = mta_stream.writeStream.outputMode("append").format("console").start()
qry.awaitTermination()
这是我得到的输出:

+----+--------------------+----------------+---------+------+--------------------+-------------+
| key|               value|           topic|partition|offset|           timestamp|timestampType|
+----+--------------------+----------------+---------+------+--------------------+-------------+
|null|�
Richa���...|mysql-01-Patient|        0|   160|2019-12-27 11:56:...|            0|
+----+--------------------+----------------+---------+------+--------------------+-------------+
如何获取字符串格式的
列?

来自

import org.apache.spark.sql.avro_
导入org.apache.avro.SchemaBuilder
//阅读卡夫卡主题的键和值时,解码
//将二进制(Avro)数据转换为结构化数据。
//结果数据帧的架构为:
val df=火花
.readStream
.格式(“卡夫卡”)
.option(“kafka.bootstrap.servers”,服务器)
.期权(“认购”、“t”)
.load()
.选择(
从_avro($“key”,SchemaBuilder.builder().stringType()).as(“key”),
来自_avro($“value”,schemabilder.builder().intType()).as(“value”))

对于阅读卡夫卡主题中的Avro消息并在pyspark结构化流媒体中进行解析,没有相同的直接库。但我们可以通过编写小包装器来读取/解析Avro消息,并在pyspark流代码中将该函数作为UDF调用

请参阅:


很明显,您得到了一些东西,尽管使用格式很难看出是什么。您是否可以更具体地说明哪个部分看起来与预期的不同(您看到的与预期的不同)?您是否使用Confluent Schema Registry?记录是avro编码的(有或没有合流模式注册表)?是数据是avro编码的@JacekLaskowskiNeed要以可读格式@Dennisjaherud从kafka获取spark格式的数据,您需要对
字段进行反序列化以查看消息。可以使用confluent的
KafkaAvroDeserializer
这只在Databricks平台上提供,不只是任何Spark消费者,我是。你可以在spark avro github页面上找到同样的问题。另外,spark 2.4版本的spark avro不支持Confluent Schema Registry,因此我建议您在从文档复制之前尝试您的答案。您的答案使用Schema Registry在哪里?
import org.apache.spark.sql.avro._

// `from_avro` requires Avro schema in JSON string format.
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc"    )))

val df = spark
    .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()


val output = df
  .select(from_avro('value, jsonFormatSchema) as 'user)
  .where("user.favorite_color == \"red\"")
  .select(to_avro($"user.name") as 'value)

val query = output
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("topic", "topic2")
  .start()
import org.apache.spark.sql.avro._
import org.apache.avro.SchemaBuilder

// When reading the key and value of a Kafka topic, decode the
// binary (Avro) data into structured data.
// The schema of the resulting DataFrame is: <key: string, value: int>
val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("subscribe", "t")
  .load()
  .select(
    from_avro($"key", SchemaBuilder.builder().stringType()).as("key"),
    from_avro($"value", SchemaBuilder.builder().intType()).as("value"))