Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/334.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 通过pyspark流式传输数据时出现UnsupportedOperationException_Java_Apache Spark_Pyspark_Databricks_Py4j - Fatal编程技术网

Java 通过pyspark流式传输数据时出现UnsupportedOperationException

Java 通过pyspark流式传输数据时出现UnsupportedOperationException,java,apache-spark,pyspark,databricks,py4j,Java,Apache Spark,Pyspark,Databricks,Py4j,我使用这段简单的代码从目录中读取json文件流。该代码在Databricks笔记本上运行良好,但在本地运行时会抛出一个错误。我使用databricks connect(版本8.1)通过集群连接并运行脚本 从pyspark.sql.types导入StructType 从pyspark.sql导入SparkSession spark=SparkSession.builder.appName(“ProcessSensorData”).getOrCreate() userschema=StructTyp

我使用这段简单的代码从目录中读取json文件流。该代码在Databricks笔记本上运行良好,但在本地运行时会抛出一个错误。我使用databricks connect(版本8.1)通过集群连接并运行脚本

从pyspark.sql.types导入StructType
从pyspark.sql导入SparkSession
spark=SparkSession.builder.appName(“ProcessSensorData”).getOrCreate()
userschema=StructType().add(“ID”,“string”).add(“Created”,“string”)\
.add(“数据”、“字符串”).add(“设备ID”、“字符串”).add(“大小”、“字符串”)
df=spark.readStream.schema(userschema.json(“dbfs:/mnt/”)
df.writeStream.format(“拼花地板”)\
.选项(“检查点位置”、“dbfs:/mnt/parquet/demo_checkpoint1”)\
.选项(“路径”、“dbfs:/mnt/parquet/demo_parquet1”)\
.start()
当我使用“read”而不是“readStream”时,上面的代码在本地运行良好。我尝试过使用不同的方式来读取流,使用选项、格式,还确认了我与databricks集群的连接。我有pyspark版本3.1.1和java 8。我总是遇到以下错误:

21/04/21 09:10:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/04/21 09:10:45 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
Traceback (most recent call last):
  File "/Users/dir/spark_process.py", line 6, in <module>
    df = spark.readStream.schema(userschema).json("dbfs:/mnt/")
  File "/Users/dir/venv/lib/python3.9/site-packages/pyspark/sql/streaming.py", line 631, in json
    return self._df(self._jreader.json(path))
  File "/Users/dir/venv/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/Users/dir/venv/lib/python3.9/site-packages/pyspark/sql/utils.py", line 110, in deco
    return f(*a, **kw)
  File "/Users/dir/venv/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o31.json.
: java.lang.UnsupportedOperationException
    at com.databricks.sql.transaction.directory.DirectoryAtomicReadProtocol$.filterDirectoryListing(DirectoryAtomicReadProtocol.scala:28)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.listLeafFiles(InMemoryFileIndex.scala:375)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.$anonfun$bulkListLeafFiles$2(InMemoryFileIndex.scala:282)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.map(TraversableLike.scala:238)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:274)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:139)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:102)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:74)
    at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:620)
    at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$sourceSchema$2(DataSource.scala:296)
    at org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:183)
    at org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$1(DataSource.scala:183)
    at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:188)
    at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:288)
    at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:137)
    at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:137)
    at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
    at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:264)
    at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:280)
    at org.apache.spark.sql.streaming.DataStreamReader.json(DataStreamReader.scala:361)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)


Process finished with exit code 1
21/04/21 09:10:44警告NativeCodeLoader:无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类
使用Spark的默认log4j配置文件:org/apache/Spark/log4j-defaults.properties
将默认日志级别设置为“警告”。
要调整日志记录级别,请使用sc.setLogLevel(newLevel)。对于SparkR,使用setLogLevel(newLevel)。
21/04/21 09:10:45警告MetricsSystem:对源使用默认名称SparkStatusTracker,因为未设置spark.metrics.namespace或spark.app.id。
回溯(最近一次呼叫最后一次):
文件“/Users/dir/spark_process.py”,第6行,在
df=spark.readStream.schema(userschema.json(“dbfs:/mnt/”)
json格式的文件“/Users/dir/venv/lib/python3.9/site packages/pyspark/sql/streaming.py”,第631行
返回self.\u df(self.\u jreader.json(path))
文件“/Users/dir/venv/lib/python3.9/site packages/py4j/java_gateway.py”,第1304行,在调用中__
返回值=获取返回值(
文件“/Users/dir/venv/lib/python3.9/site packages/pyspark/sql/utils.py”,第110行,deco格式
返回f(*a,**kw)
文件“/Users/dir/venv/lib/python3.9/site packages/py4j/protocol.py”,第326行,在get_return_值中
引发Py4JJavaError(
py4j.protocol.Py4JJavaError:调用o31.json时出错。
:java.lang.UnsupportedOperationException
在com.databricks.sql.transaction.directory.DirectoryAtomicReadProtocol$.FilterDirectoryList(DirectoryAtomicReadProtocol.scala:28)
位于org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.listLeafFiles(InMemoryFileIndex.scala:375)
位于org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.$anonfun$bulkListLeafFiles$2(InMemoryFileIndex.scala:282)
位于scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
位于scala.collection.mutable.resizeblearray.foreach(resizeblearray.scala:62)
位于scala.collection.mutable.resizeblearray.foreach$(resizeblearray.scala:55)
位于scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
位于scala.collection.TraversableLike.map(TraversableLike.scala:238)
位于scala.collection.TraversableLike.map$(TraversableLike.scala:231)
位于scala.collection.AbstractTraversable.map(Traversable.scala:108)
位于org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:274)
位于org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:139)
位于org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:102)
位于org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(InMemoryFileIndex.scala:74)
位于org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:620)
位于org.apache.spark.sql.execution.datasources.DataSource.$anonfun$sourceSchema$2(DataSource.scala:296)
位于org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:183)
位于org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$1(DataSource.scala:183)
位于org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:188)
位于org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:288)
位于org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:137)
位于org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:137)
位于org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
位于org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:264)
位于org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:280)
位于org.apache.spark.sql.streaming.DataStreamReader.json(DataStreamReader.scala:361)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:498)
位于py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
位于py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
在py4j.Gateway.invoke处(Gateway.java:295)
位于py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
在py4j.commands.CallCommand.execute(CallCommand.java:79)
在py4j.GatewayConnection.run处(GatewayConnection.java:251)
运行(Thread.java:748)
进程已完成,退出代码为1

如果有人能帮我解决这个问题,那将是一个很大的帮助,谢谢!

你真的需要直接从
/mnt
读取吗?通常文件位于更深的层次