Apache spark 在Apache Beam管道中将PubsubIO与Spark Runner一起使用时出现NullPointerException

Apache spark 在Apache Beam管道中将PubsubIO与Spark Runner一起使用时出现NullPointerException,apache-spark,apache-beam,apache-beam-io,Apache Spark,Apache Beam,Apache Beam Io,我有一个非常小的示例性Apache Beam管道,我正试图使用SparkRunner运行它 下面是管道代码 公共类SparkMain{ 公共静态void main(字符串[]args){ PipelineOptions=PipelineOptionsFactory.fromArgs(args).withValidation().create(); Pipeline=Pipeline.create(选项); 最终字符串projectd=“”; 最终字符串数据集=“测试数据集”; Duration

我有一个非常小的示例性Apache Beam管道,我正试图使用
SparkRunner
运行它

下面是管道代码

公共类SparkMain{
公共静态void main(字符串[]args){
PipelineOptions=PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline=Pipeline.create(选项);
最终字符串projectd=“”;
最终字符串数据集=“测试数据集”;
Duration durations=DurationUtils.parseDuration(“10秒”);
管道.apply(“从PubSub读取”,
PubsubIO.ReadMessagesWithatAttributes().fromSubscription(“我的订阅”))
.apply(“Window”,Window.into(new GlobalWindows())。触发(AfterWatermark.pastEndOfWindow())
.具有早期点火(在第一次点火之后)(在第二次点火之后)(至少在第10次点火之后),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(持续时间)))
.discardingFiredPanes())
.apply(“转换为字符串”),ParDo.of(new DoFn(){
@过程元素
public void processElement(ProcessContext上下文){
PubsubMessage msg=context.element();
字符串msgStr=新字符串(msg.getPayload());
输出(msgStr);
}
}))
.apply(“写入文件”),TextIO
.write()
.withWindowedWrites()
.与努姆沙兹(1)
.to(“/Users/my user/Documents/spark beam local/windowed output”);
pipeline.run();
}
}
我在本地模式下使用Apache Beam
2.16.0
和Spark
2.4.4

当我尝试使用
DirectRunner
DataflowRunner
运行此管道时,一切正常,但当我将运行程序切换到
SparkRunner
时,任务开始失败,出现以下异常

19/12/17 12:15:45 INFO MicrobatchSource: No cached reader found for split: [org.apache.beam.sdk.io.gcp.pubsub.PubsubUnboundedSource$PubsubSource@46d6c879]. Creating new reader at checkpoint mark null
19/12/17 12:15:46 WARN BlockManager: Putting block rdd_7_9 failed due to exception java.lang.NullPointerException.
19/12/17 12:15:46 WARN BlockManager: Block rdd_7_9 could not be removed as it was not found on disk or in memory
19/12/17 12:15:46 ERROR Executor: Exception in task 9.0 in stage 2.0 (TID 9)
java.lang.NullPointerException
    at org.apache.beam.sdk.io.gcp.pubsub.PubsubUnboundedSource$PubsubReader.getWatermark(PubsubUnboundedSource.java:941)
    at org.apache.beam.runners.spark.io.MicrobatchSource$Reader.getWatermark(MicrobatchSource.java:291)
    at org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:181)
    at org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:107)
    at org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:181)
    at org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:180)
    at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:57)
    at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
    at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:159)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
我正在使用以下
spark提交

spark-submit --class SparkMain \
--master local[*] target/beam-1.0-SNAPSHOT.jar \
--runner=SparkRunner \
--project=<my-project> \
--gcpTempLocation=gs://<my-bucket>/temp \
--checkpointDir=/Users/my-user/Documents/beam-tmp/
spark提交——类SparkMain\
--主本地[*]目标/beam-1.0-SNAPSHOT.jar\
--跑者=斯巴克跑者\
--项目=\
--gcpTempLocation=gs:///temp\
--checkpointDir=/Users/my user/Documents/beam tmp/
有一个非常相似但尚未回答的问题


有人能告诉我如何开始调试这个问题吗?

不知道,但是删除
--runner=SparkRunner
似乎可以解决问题。我的错误是,删除
--runner=SparkRunner
导致管道与
DirectRunner
一起运行,因此它开始工作。