Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark结构化流媒体应用程序从Kafka读取只返回空值_Apache Spark_Pyspark_Apache Kafka_Spark Structured Streaming - Fatal编程技术网

Apache spark Spark结构化流媒体应用程序从Kafka读取只返回空值

Apache spark Spark结构化流媒体应用程序从Kafka读取只返回空值,apache-spark,pyspark,apache-kafka,spark-structured-streaming,Apache Spark,Pyspark,Apache Kafka,Spark Structured Streaming,我计划使用Spark结构化流媒体从卡夫卡中提取数据,但我得到的是空数据 #-*-编码:utf-8-*- 从pyspark.sql导入SparkSession 从pyspark.sql.functions从\u csv、从\u json导入 从pyspark.sql.types导入StringType、StructType 如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu': 火花=火花会话\ 建筑商先生\ .appName(“Pypark\u structure

我计划使用Spark结构化流媒体从卡夫卡中提取数据,但我得到的是空数据

#-*-编码:utf-8-*-
从pyspark.sql导入SparkSession
从pyspark.sql.functions从\u csv、从\u json导入
从pyspark.sql.types导入StringType、StructType
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
火花=火花会话\
建筑商先生\
.appName(“Pypark\u structured\u streaming\u kafka”)\
.getOrCreate()
df_raw=spark.read\
.格式(“卡夫卡”)\
.option(“kafka.bootstrap.servers”,“52.81.249.81:9092”)\
.期权(“认购”、“产品”)\
.option(“kafka.ssl.endpoint.identification.algorithm”,“”)\
.option(“kafka.isolation.level”、“read_committed”)\
.load()
df_raw.printSchema()
product_schema=StructType()\
.add(“产品名称”,StringType())\
.add(“产品工厂”,StringType())\
.add(“yield_num”,StringType())\
.add(“屈服时间”,StringType())
df_1=df_raw.selectExpr(“转换(值为字符串)”)\
.select(从_json(“值”,产品_架构)。别名(“数据”))\
.选择(“数据。*”)\
.写\
.格式(“控制台”)\
.save()
我的测试数据如下

Batch: 3130
-------------------------------------------
+--------------------+
|               value|
+--------------------+
|    "yield_time":...|
|    "product_name...|
|    "yield_num": ...|
|    "product_fact...|
|    "yield_num": ...|
|    "yield_num": ...|
|    "product_fact...|
|    "product_fact...|
|    "product_name...|
|    "product_fact...|
|    "product_name...|
|                   }|
|    "yield_time":...|
|    "product_name...|
|                   }|
|    "product_fact...|
|    "yield_num": ...|
|    "product_fact...|
|    "yield_time":...|
|    "product_name...|
+--------------------+
{
“产品名称”:“X笔记本电脑”,
“产品工厂”:“B-3231”,
“产量”:899,
“收益率时间”:“20210201 22:00:01”
}
但结果超出了我的预测

./spark-submit ~/Documents/3-Playground/kbatch.py
+------------+---------------+---------+----------+
|product_name|product_factory|yield_num|yield_time|
+------------+---------------+---------+----------+
|        null|           null|     null|      null|
|        null|           null|     null|      null|
通过以下命令发布测试数据:

/kafka-producer-perf-test.sh--topic product--num records 90000000--throughput 5--producer.config../config/producer.properties--payload file~/Downloads/product.json
如果像这样删掉一些代码

df_1=df_raw.selectExpr(“转换(值为字符串)”)\
.writeStream\
.格式(“控制台”)\
.outputMode(“追加”)\
.选项(“检查点位置”file:///Users/picomy/Kafka-Output/checkpoint") \
.start()\
.终止
结果如下

Batch: 3130
-------------------------------------------
+--------------------+
|               value|
+--------------------+
|    "yield_time":...|
|    "product_name...|
|    "yield_num": ...|
|    "product_fact...|
|    "yield_num": ...|
|    "yield_num": ...|
|    "product_fact...|
|    "product_fact...|
|    "product_name...|
|    "product_fact...|
|    "product_name...|
|                   }|
|    "yield_time":...|
|    "product_name...|
|                   }|
|    "product_fact...|
|    "yield_num": ...|
|    "product_fact...|
|    "yield_time":...|
|    "product_name...|
+--------------------+

我不知道问题的根本原因在哪里。

很少有事情会导致代码无法正常工作:

  • 架构错误(字段yield_num为整数/长)
  • 使用writeStream而不是只写(如果需要流式处理)
  • 流式查询的开始和终止
  • json文件中的数据应仅存储在一行中
您可以使用以下代码段替换部分代码:

from pyspark.sql.types import StringType, StructType, LongType

    product_schema = StructType() \
        .add("product_name", StringType()) \
        .add("product_factory", StringType()) \
        .add("yield_num", LongType()) \
        .add("yield_time", StringType()) 

    df_1=df_raw.selectExpr("CAST(value AS STRING)") \
               .select(from_json("value",product_schema).alias("data")) \
               .select("data.*") \
               .writeStream \
               .format("console") \
               .start()
               .awaitTermination()