Pyspark 在spark流媒体/结构化流媒体中阅读来自卡夫卡的avro消息
我第一次使用pyspark。 Spark版本:2.3.0 卡夫卡版本:2.2.0 我有一个kafka制作人,它以avro格式发送嵌套数据,我正在尝试用pyspark编写spark流/结构化流的代码,它将来自kafka的avro反序列化为dataframe,并将其以拼花格式写入s3。 我能够在spark/scala中找到avro转换器,但pyspark中的支持尚未添加。如何在pyspark中转换相同的值。Pyspark 在spark流媒体/结构化流媒体中阅读来自卡夫卡的avro消息,pyspark,apache-kafka,spark-streaming,spark-structured-streaming,spark-streaming-kafka,Pyspark,Apache Kafka,Spark Streaming,Spark Structured Streaming,Spark Streaming Kafka,我第一次使用pyspark。 Spark版本:2.3.0 卡夫卡版本:2.2.0 我有一个kafka制作人,它以avro格式发送嵌套数据,我正在尝试用pyspark编写spark流/结构化流的代码,它将来自kafka的avro反序列化为dataframe,并将其以拼花格式写入s3。 我能够在spark/scala中找到avro转换器,但pyspark中的支持尚未添加。如何在pyspark中转换相同的值。 谢谢。正如您所提到的,阅读卡夫卡的Avro消息并通过pyspark进行解析,没有直接的库。但
谢谢。正如您所提到的,阅读卡夫卡的Avro消息并通过pyspark进行解析,没有直接的库。但我们可以通过编写小包装器来读取/解析Avro消息,并在pyspark流代码中将该函数作为UDF调用,如下所示 参考资料: 注意:Avro是自Spark 2.4以来的内置但外部数据源模块。请按照“ApacheAvro数据源指南”的部署部分部署应用程序 参考:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('streamingdata').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topicname"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost.com:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss" ,"false")\
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"MCI-CIL")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "changeit") \
.option("kafka.sasl.kerberos.keytab","/path/bdpda.headless.keytab") \
.option("kafka.sasl.kerberos.principal","bdpda") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
# Deserilzing the Avro code function
from pyspark.sql.column import Column, _to_java_column
def from_avro(col):
jsonFormatSchema = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
spark.udf.register("JsonformatterWithPython", from_avro)
squared_udf = udf(from_avro)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.format("parquet") \
.option("checkpointLocation","/path/chk31") \
.outputMode("append") \
.start("/path/stream/tgt31")
ssc.awaitTermination()
Spark提交:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('streamingdata').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topicname"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost.com:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss" ,"false")\
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"MCI-CIL")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "changeit") \
.option("kafka.sasl.kerberos.keytab","/path/bdpda.headless.keytab") \
.option("kafka.sasl.kerberos.principal","bdpda") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
# Deserilzing the Avro code function
from pyspark.sql.column import Column, _to_java_column
def from_avro(col):
jsonFormatSchema = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
spark.udf.register("JsonformatterWithPython", from_avro)
squared_udf = udf(from_avro)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.format("parquet") \
.option("checkpointLocation","/path/chk31") \
.outputMode("append") \
.start("/path/stream/tgt31")
ssc.awaitTermination()
[调整软件包版本以匹配基于spark/avro版本的安装]
/usr/hdp/2.6.1.0-129/spark2/bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.3 --conf spark.ui.port=4064
Pyspark流媒体代码:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('streamingdata').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topicname"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost.com:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss" ,"false")\
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"MCI-CIL")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "changeit") \
.option("kafka.sasl.kerberos.keytab","/path/bdpda.headless.keytab") \
.option("kafka.sasl.kerberos.principal","bdpda") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
# Deserilzing the Avro code function
from pyspark.sql.column import Column, _to_java_column
def from_avro(col):
jsonFormatSchema = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
spark.udf.register("JsonformatterWithPython", from_avro)
squared_udf = udf(from_avro)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.format("parquet") \
.option("checkpointLocation","/path/chk31") \
.outputMode("append") \
.start("/path/stream/tgt31")
ssc.awaitTermination()
正如您所提到的,从卡夫卡读取Avro消息并通过pyspark进行解析,并没有直接的库。但我们可以通过编写小包装器来读取/解析Avro消息,并在pyspark流代码中将该函数作为UDF调用,如下所示 参考资料: 注意:Avro是自Spark 2.4以来的内置但外部数据源模块。请按照“ApacheAvro数据源指南”的部署部分部署应用程序 参考:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('streamingdata').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topicname"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost.com:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss" ,"false")\
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"MCI-CIL")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "changeit") \
.option("kafka.sasl.kerberos.keytab","/path/bdpda.headless.keytab") \
.option("kafka.sasl.kerberos.principal","bdpda") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
# Deserilzing the Avro code function
from pyspark.sql.column import Column, _to_java_column
def from_avro(col):
jsonFormatSchema = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
spark.udf.register("JsonformatterWithPython", from_avro)
squared_udf = udf(from_avro)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.format("parquet") \
.option("checkpointLocation","/path/chk31") \
.outputMode("append") \
.start("/path/stream/tgt31")
ssc.awaitTermination()
Spark提交:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('streamingdata').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topicname"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost.com:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss" ,"false")\
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"MCI-CIL")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "changeit") \
.option("kafka.sasl.kerberos.keytab","/path/bdpda.headless.keytab") \
.option("kafka.sasl.kerberos.principal","bdpda") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
# Deserilzing the Avro code function
from pyspark.sql.column import Column, _to_java_column
def from_avro(col):
jsonFormatSchema = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
spark.udf.register("JsonformatterWithPython", from_avro)
squared_udf = udf(from_avro)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.format("parquet") \
.option("checkpointLocation","/path/chk31") \
.outputMode("append") \
.start("/path/stream/tgt31")
ssc.awaitTermination()
[调整软件包版本以匹配基于spark/avro版本的安装]
/usr/hdp/2.6.1.0-129/spark2/bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.3 --conf spark.ui.port=4064
Pyspark流媒体代码:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('streamingdata').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topicname"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost.com:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss" ,"false")\
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"MCI-CIL")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "changeit") \
.option("kafka.sasl.kerberos.keytab","/path/bdpda.headless.keytab") \
.option("kafka.sasl.kerberos.principal","bdpda") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
# Deserilzing the Avro code function
from pyspark.sql.column import Column, _to_java_column
def from_avro(col):
jsonFormatSchema = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
spark.udf.register("JsonformatterWithPython", from_avro)
squared_udf = udf(from_avro)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.format("parquet") \
.option("checkpointLocation","/path/chk31") \
.outputMode("append") \
.start("/path/stream/tgt31")
ssc.awaitTermination()
为什么您需要df5和df1是相同的东西?Alsop,正如它对您链接的答案所做的评论,这对Confluent Schema Registry Avro Database不起作用。感谢您分享您的评论。你是对的!!df5不是必需的,并更正了该代码。正如在问题中没有提到的,关于confluent registry给定了通用Avro格式处理的codeRight,但这很少是Kafka中的内容,根据我的经验,请注意,这段代码,我已经在Cloudra环境中用经典的Apache KafkaOkay实现了,很好,那么您没有使用模式注册表?Confluent只使用ApacheKafka…为什么需要df5和df1是相同的东西?Alsop,正如它对您链接的答案所做的评论,这对Confluent Schema Registry Avro Database不起作用。感谢您分享您的评论。你是对的!!df5不是必需的,并更正了该代码。正如在问题中没有提到的,关于confluent registry给定了通用Avro格式处理的codeRight,但这很少是Kafka中的内容,根据我的经验,请注意,这段代码,我已经在Cloudra环境中用经典的Apache KafkaOkay实现了,很好,那么您没有使用模式注册表?Confluent只使用ApacheKafka。。。