Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/apache-kafka/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用PySpark结构流媒体+;卡夫卡_Pyspark_Apache Kafka_Spark Structured Streaming - Fatal编程技术网

如何使用PySpark结构流媒体+;卡夫卡

如何使用PySpark结构流媒体+;卡夫卡,pyspark,apache-kafka,spark-structured-streaming,Pyspark,Apache Kafka,Spark Structured Streaming,我试着用kafka使用spark结构流媒体,当我使用spark提交时出现问题,消费者仍然从产品接收数据,但spark结构是错误的。请帮助我查找代码中的问题 下面是我在test.py中的代码: from kafka import KafkaProducer from kafka import KafkaConsumer from pyspark.sql import SparkSession spark = SparkSession.builder.appName('stream_test').g

我试着用kafka使用spark结构流媒体,当我使用spark提交时出现问题,消费者仍然从产品接收数据,但spark结构是错误的。请帮助我查找代码中的问题 下面是我在test.py中的代码:

from kafka import KafkaProducer
from kafka import KafkaConsumer
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('stream_test').getOrCreate()
import random

producer = KafkaProducer(bootstrap_servers=["localhost:9092"])
for i in range(0,100):
    lg_value = str(random.uniform(5000, 10000))
    producer.send(topic = 'test', value = bytes(lg_value, encoding='utf-8'))
    producer.flush()

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092") \
    .option("subscribe","test").load()
df_to_string = df.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")
print("done")
当我跑步时: spark提交——包org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 test.py 终端输出:

> 20/07/12 19:39:09 INFO Executor: Starting executor ID driver on host
> 192.168.31.129 20/07/12 19:39:09 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on
> port 38885. 20/07/12 19:39:09 INFO NettyBlockTransferService: Server
> created on 192.168.31.129:38885 20/07/12 19:39:09 INFO BlockManager:
> Using org.apache.spark.storage.RandomBlockReplicationPolicy for block
> replication policy 20/07/12 19:39:09 INFO BlockManagerMaster:
> Registering BlockManager BlockManagerId(driver, 192.168.31.129, 38885,
> None) 20/07/12 19:39:09 INFO BlockManagerMasterEndpoint: Registering
> block manager 192.168.31.129:38885 with 413.9 MiB RAM,
> BlockManagerId(driver, 192.168.31.129, 38885, None) 20/07/12 19:39:09
> INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, 192.168.31.129, 38885, None) 20/07/12 19:39:09
> INFO BlockManager: Initialized BlockManager: BlockManagerId(driver,
> 192.168.31.129, 38885, None) 20/07/12 19:39:11 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of
> spark.sql.warehouse.dir ('file:/home/thoaint2/spark-warehouse').
> 20/07/12 19:39:11 INFO SharedState: Warehouse path is
> 'file:/home/thoaint2/spark-warehouse'. Traceback (most recent call
> last):   File "/home/thoaint2/test.py", line 15, in <module>
>     df = spark.readStream.format("kafka").option('kafka.bootstrap.servers','localhost:9092')
> \   File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 420, in load   File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
> line 1304, in __call__   File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 131, in deco   File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error
> occurred while calling o31.load. : java.lang.NoClassDefFoundError:
> org/apache/kafka/common/serialization/ByteArraySerializer     at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:557)
>   at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
>   at
> org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:325)
>   at
> org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:70)
>   at
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:220)
>   at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:112)
>   at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:112)
>   at
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
>   at
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:205)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)     at
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)  at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)    at
> py4j.Gateway.invoke(Gateway.java:282)     at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)   at
> py4j.GatewayConnection.run(GatewayConnection.java:238)    at
> java.lang.Thread.run(Thread.java:748) Caused by:
> java.lang.ClassNotFoundException:
> org.apache.kafka.common.serialization.ByteArraySerializer     at
> java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>20/07/12 19:39:09信息执行器:正在主机上启动执行器ID驱动程序
>192.168.31.129 20/07/12 19:39:09信息实用程序:已在上成功启动服务“org.apache.spark.network.netty.NettyBlockTransferService”
>端口38885。20/07/12 19:39:09信息NettyBlockTransferService:服务器
>创建于192.168.31.129:38885 20/07/12 19:39:09信息区块管理器:
>对块使用org.apache.spark.storage.RandomBlockReplicationPolicy
>复制策略20/07/12 19:39:09信息区块管理器管理员:
>正在注册BlockManager BlockManagerId(驱动程序,192.168.31.129,38885,
>无)20/07/12 19:39:09信息块管理器MasterEndpoint:正在注册
>块管理器192.168.31.129:38885,带413.9 MiB RAM,
>BlockManagerId(驱动程序,192.168.31.12938885,无)20/07/12 19:39:09
>信息BlockManager管理员:已注册的BlockManager
>BlockManagerId(驱动程序,192.168.31.12938885,无)20/07/12 19:39:09
>信息块管理器:初始化的块管理器:块管理器ID(驱动程序,
>192.168.31.12938885,无)20/07/12 19:39:11信息共享状态:将hive.metastore.warehouse.dir('null')的值设置为
>spark.sql.warehouse.dir('file:/home/thoaint2/spark warehouse')。
>20/07/12 19:39:11信息共享状态:仓库路径为
>'文件:/home/thoaint2/spark warehouse'。回溯(最近的呼叫)
>最后):文件“/home/thoaint2/test.py”,第15行,在
>df=spark.readStream.format(“kafka”).option('kafka.bootstrap.servers','localhost:9092')
>\文件
>加载文件中的第420行“/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/streaming.py”
>“/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py”,
>第1304行,在调用文件中
>“/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py”,
>第131行,在deco文件中
>“/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py”,第326行,在获取返回值py4j.protocol.Py4JJavaError中:错误
>调用o31.load时发生:java.lang.NoClassDefFoundError:
>org/apache/kafka/common/serialization/ByteArraySerializer位于
>org.apache.spark.sql.kafka010.KafkaSourceProvider$(KafkaSourceProvider.scala:557)
>在
>org.apache.spark.sql.kafka010.KafkaSourceProvider$(KafkaSourceProvider.scala)
>在
>org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:325)
>在
>org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:70)
>在
>org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:220)
>在
>org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:112)
>在
>org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:112)
>在
>org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
>在
>org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:205)
>位于的sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)
>invoke(NativeMethodAccessorImpl.java:62)
>在
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>位于java.lang.reflect.Method.invoke(Method.java:498)
>py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)位于
>位于的py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>py4j.Gateway.invoke(Gateway.java:282)位于
>py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>在py4j.commands.CallCommand.execute(CallCommand.java:79)处
>在上运行(GatewayConnection.java:238)
>java.lang.Thread.run(Thread.java:748)由以下原因引起:
>java.lang.ClassNotFoundException:
>org.apache.kafka.common.serialization.ByteArraySerializer位于
>java.net.URLClassLoader.findClass(URLClassLoader.java:382)

您需要将kafka客户端JAR添加到您的
--包中

还要注意的是,Spark也可以作为制作人使用,所以您不需要使用不同的Python Kafka库

如果您只是想在不使用JVM的情况下处理Kafka流,那么请查看Faust