如何使用PySpark结构流媒体+;卡夫卡
我试着用kafka使用spark结构流媒体,当我使用spark提交时出现问题,消费者仍然从产品接收数据,但spark结构是错误的。请帮助我查找代码中的问题 下面是我在test.py中的代码:如何使用PySpark结构流媒体+;卡夫卡,pyspark,apache-kafka,spark-structured-streaming,Pyspark,Apache Kafka,Spark Structured Streaming,我试着用kafka使用spark结构流媒体,当我使用spark提交时出现问题,消费者仍然从产品接收数据,但spark结构是错误的。请帮助我查找代码中的问题 下面是我在test.py中的代码: from kafka import KafkaProducer from kafka import KafkaConsumer from pyspark.sql import SparkSession spark = SparkSession.builder.appName('stream_test').g
from kafka import KafkaProducer
from kafka import KafkaConsumer
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('stream_test').getOrCreate()
import random
producer = KafkaProducer(bootstrap_servers=["localhost:9092"])
for i in range(0,100):
lg_value = str(random.uniform(5000, 10000))
producer.send(topic = 'test', value = bytes(lg_value, encoding='utf-8'))
producer.flush()
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092") \
.option("subscribe","test").load()
df_to_string = df.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")
print("done")
当我跑步时:
spark提交——包org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 test.py
终端输出:
> 20/07/12 19:39:09 INFO Executor: Starting executor ID driver on host
> 192.168.31.129 20/07/12 19:39:09 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on
> port 38885. 20/07/12 19:39:09 INFO NettyBlockTransferService: Server
> created on 192.168.31.129:38885 20/07/12 19:39:09 INFO BlockManager:
> Using org.apache.spark.storage.RandomBlockReplicationPolicy for block
> replication policy 20/07/12 19:39:09 INFO BlockManagerMaster:
> Registering BlockManager BlockManagerId(driver, 192.168.31.129, 38885,
> None) 20/07/12 19:39:09 INFO BlockManagerMasterEndpoint: Registering
> block manager 192.168.31.129:38885 with 413.9 MiB RAM,
> BlockManagerId(driver, 192.168.31.129, 38885, None) 20/07/12 19:39:09
> INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, 192.168.31.129, 38885, None) 20/07/12 19:39:09
> INFO BlockManager: Initialized BlockManager: BlockManagerId(driver,
> 192.168.31.129, 38885, None) 20/07/12 19:39:11 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of
> spark.sql.warehouse.dir ('file:/home/thoaint2/spark-warehouse').
> 20/07/12 19:39:11 INFO SharedState: Warehouse path is
> 'file:/home/thoaint2/spark-warehouse'. Traceback (most recent call
> last): File "/home/thoaint2/test.py", line 15, in <module>
> df = spark.readStream.format("kafka").option('kafka.bootstrap.servers','localhost:9092')
> \ File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 420, in load File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
> line 1304, in __call__ File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 131, in deco File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error
> occurred while calling o31.load. : java.lang.NoClassDefFoundError:
> org/apache/kafka/common/serialization/ByteArraySerializer at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:557)
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:325)
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:70)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:220)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:112)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:112)
> at
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
> at
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:205)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
> py4j.Gateway.invoke(Gateway.java:282) at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79) at
> py4j.GatewayConnection.run(GatewayConnection.java:238) at
> java.lang.Thread.run(Thread.java:748) Caused by:
> java.lang.ClassNotFoundException:
> org.apache.kafka.common.serialization.ByteArraySerializer at
> java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>20/07/12 19:39:09信息执行器:正在主机上启动执行器ID驱动程序
>192.168.31.129 20/07/12 19:39:09信息实用程序:已在上成功启动服务“org.apache.spark.network.netty.NettyBlockTransferService”
>端口38885。20/07/12 19:39:09信息NettyBlockTransferService:服务器
>创建于192.168.31.129:38885 20/07/12 19:39:09信息区块管理器:
>对块使用org.apache.spark.storage.RandomBlockReplicationPolicy
>复制策略20/07/12 19:39:09信息区块管理器管理员:
>正在注册BlockManager BlockManagerId(驱动程序,192.168.31.129,38885,
>无)20/07/12 19:39:09信息块管理器MasterEndpoint:正在注册
>块管理器192.168.31.129:38885,带413.9 MiB RAM,
>BlockManagerId(驱动程序,192.168.31.12938885,无)20/07/12 19:39:09
>信息BlockManager管理员:已注册的BlockManager
>BlockManagerId(驱动程序,192.168.31.12938885,无)20/07/12 19:39:09
>信息块管理器:初始化的块管理器:块管理器ID(驱动程序,
>192.168.31.12938885,无)20/07/12 19:39:11信息共享状态:将hive.metastore.warehouse.dir('null')的值设置为
>spark.sql.warehouse.dir('file:/home/thoaint2/spark warehouse')。
>20/07/12 19:39:11信息共享状态:仓库路径为
>'文件:/home/thoaint2/spark warehouse'。回溯(最近的呼叫)
>最后):文件“/home/thoaint2/test.py”,第15行,在
>df=spark.readStream.format(“kafka”).option('kafka.bootstrap.servers','localhost:9092')
>\文件
>加载文件中的第420行“/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/streaming.py”
>“/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py”,
>第1304行,在调用文件中
>“/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py”,
>第131行,在deco文件中
>“/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py”,第326行,在获取返回值py4j.protocol.Py4JJavaError中:错误
>调用o31.load时发生:java.lang.NoClassDefFoundError:
>org/apache/kafka/common/serialization/ByteArraySerializer位于
>org.apache.spark.sql.kafka010.KafkaSourceProvider$(KafkaSourceProvider.scala:557)
>在
>org.apache.spark.sql.kafka010.KafkaSourceProvider$(KafkaSourceProvider.scala)
>在
>org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:325)
>在
>org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:70)
>在
>org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:220)
>在
>org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:112)
>在
>org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:112)
>在
>org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
>在
>org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:205)
>位于的sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)
>invoke(NativeMethodAccessorImpl.java:62)
>在
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>位于java.lang.reflect.Method.invoke(Method.java:498)
>py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)位于
>位于的py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>py4j.Gateway.invoke(Gateway.java:282)位于
>py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>在py4j.commands.CallCommand.execute(CallCommand.java:79)处
>在上运行(GatewayConnection.java:238)
>java.lang.Thread.run(Thread.java:748)由以下原因引起:
>java.lang.ClassNotFoundException:
>org.apache.kafka.common.serialization.ByteArraySerializer位于
>java.net.URLClassLoader.findClass(URLClassLoader.java:382)
您需要将kafka客户端JAR添加到您的--包中
还要注意的是,Spark也可以作为制作人使用,所以您不需要使用不同的Python Kafka库
如果您只是想在不使用JVM的情况下处理Kafka流,那么请查看Faust