Apache spark pyspark.sql.utils.AnalysisException:未能找到数据源:kafka

Apache spark pyspark.sql.utils.AnalysisException:未能找到数据源:kafka,apache-spark,pyspark,apache-kafka,pyspark-sql,spark-structured-streaming,Apache Spark,Pyspark,Apache Kafka,Pyspark Sql,Spark Structured Streaming,我正在尝试使用pyspark读取卡夫卡的流。我正在使用spark版本3.0.0-preview2和spark-streaming-kafka-0-10_2.12 在此之前,我只是统计zookeeper,kafka并创建一个新主题: /usr/local/kafka/bin/zookeeper-server-start.sh /usr/local/kafka/config/zookeeper.properties /usr/local/kafka/bin/kafka-server-start.s

我正在尝试使用pyspark读取卡夫卡的流。我正在使用spark版本3.0.0-preview2和spark-streaming-kafka-0-10_2.12 在此之前,我只是统计zookeeper,kafka并创建一个新主题:

/usr/local/kafka/bin/zookeeper-server-start.sh /usr/local/kafka/config/zookeeper.properties 
/usr/local/kafka/bin/kafka-server-start.sh /usr/local/kafka/config/server.properties
/usr/local/kafka/bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic data_wm
这是我的代码:

import pandas as pd
import os
import findspark
findspark.init("/usr/local/spark")
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TestApp").getOrCreate()
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "data_wm") \
  .load() 
value = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") 
以下是我运行脚本的方式:

sudo--preserve env=pyspark/usr/local/spark/bin/pyspark--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.0.0-preview

作为此命令的结果,我有:

: resolving dependencies :: org.apache.spark#spark-submit-parent-0d7b2a8d-a860-4766-a4c7-141a902d8365;1.0
        confs: [default]
        found org.apache.spark#spark-streaming-kafka-0-10_2.12;3.0.0-preview in central
        found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.0.0-preview in central
        found org.apache.kafka#kafka-clients;2.3.1 in central
        found com.github.luben#zstd-jni;1.4.3-1 in central
        found org.lz4#lz4-java;1.6.0 in central
        found org.xerial.snappy#snappy-java;1.1.7.3 in central
        found org.slf4j#slf4j-api;1.7.16 in central
        found org.spark-project.spark#unused;1.0.0 in central :: resolution report :: resolve 380ms :: artifacts dl 7ms
        :: modules in use:
        com.github.luben#zstd-jni;1.4.3-1 from central in [default]
        org.apache.kafka#kafka-clients;2.3.1 from central in [default]
        org.apache.spark#spark-streaming-kafka-0-10_2.12;3.0.0-preview from central in [default]
        org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.0.0-preview from central in [default]
        org.lz4#lz4-java;1.6.0 from central in [default]
        org.slf4j#slf4j-api;1.7.16 from central in [default]
        org.spark-project.spark#unused;1.0.0 from central in [default]
        org.xerial.snappy#snappy-java;1.1.7.3 from central in [default]
但我总是犯这样的错误:

d> f=火花\。读流\。格式(“卡夫卡”)\

.option(“kafka.bootstrap.servers”,“localhost:9092”)\…
.选项(“订阅”、“数据管理”)\。load()回溯(most 最近调用(最后):文件“”,第5行,在文件中 “/usr/local/spark/python/pyspark/sql/streaming.py”,第406行,加载 返回self._df(self._jreader.load())文件“/usr/local/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py”, 第1286行,在调用文件中 “/usr/local/spark/python/pyspark/sql/utils.py”,第102行,deco格式 raise converted pyspark.sql.utils.AnalysisException:找不到数据源:kafka。请按照以下步骤部署应用程序: “结构化流媒体+卡夫卡集成”的部署部分 指南“


我不知道此错误的原因,请帮助

我已在Spark 3.0.1(使用PySpark)上成功解决此错误

我会保持简单,并通过
--packages
参数提供所需的包:

spark提交——包org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 MyPythonScript.py
注意参数的顺序,否则会抛出错误

其中
MyPythonScript.py
具有:

卡夫卡主题:数据 KAFKA_SERVER=“localhost:9092” #创建SparkSession的实例 spark_会话=SparkSession\ 建筑商先生\ .appName(“Python Spark创建RDD”)\ .getOrCreate() #订阅1个主题 df=火花_会话\ .readStream\ .格式(“卡夫卡”)\ .option(“kafka.bootstrap.servers”,kafka_服务器)\ .选项(“订阅”,卡夫卡主题)\ .load() 打印(df.selectExpr(“转换(键为字符串)”,“转换(值为字符串)”)
Unrelated-你为什么使用sudo运行pyspark?你真的安装了Spark 3预览版吗?即使我不使用sudo,我也有同样的问题,是的,我的笔记本电脑上安装了Spark 3.0.0-preview2。你不使用预览版也有同样的问题吗?不,即使使用预览版,我也有同样的问题