Dataproc集群中的Scala Spark作业返回java.util.NoSuchElementException:None.get

Dataproc集群中的Scala Spark作业返回java.util.NoSuchElementException:None.get,scala,apache-spark,google-cloud-dataproc,Scala,Apache Spark,Google Cloud Dataproc,我得到了错误 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.util.NoSuchElementException: None.get 当我使用Dataproc集群运行我的作业时,当我在本地运行它时,它运行得非常好。我使用下面的玩具示例重新创建了这个问题 package com.deequ_unit_tests import org.apache.log4j.

我得到了错误

ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.util.NoSuchElementException: None.get
当我使用Dataproc集群运行我的作业时,当我在本地运行它时,它运行得非常好。我使用下面的玩具示例重新创建了这个问题

package com.deequ_unit_tests

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

object reduce_by_key_example {def main(args: Array[String]): Unit = {

  // Set the log level to only print errors
  Logger.getLogger("org").setLevel(Level.ERROR)

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  println("Step 1")
  val data = Seq(("Project", 1),
    ("Gutenberg’s", 1),
    ("Alice’s", 1),
    ("Adventures", 1),
    ("in", 1),
    ("Wonderland", 1),
    ("Project", 1),
    ("Gutenberg’s", 1),
    ("Adventures", 1),
    ("in", 1),
    ("Wonderland", 1),
    ("Project", 1),
    ("Gutenberg’s", 1))

  println("Step 2")
  val rdd = spark.sparkContext.parallelize(data)

  println("Step 3")
  val rdd2 = rdd.reduceByKey(_ + _)

  println("Step 4")
  rdd2.foreach(println)
  }
}
在Dataproc中运行此作业时,在执行该行时会出现此错误

rdd2.foreach(println)
作为补充信息,我必须说,在我公司的Dataproc集群中应用了一些更改之前,我没有收到这个错误。对于使用PySpark的同事,使用上面示例的PySpark中的等效版本,更改

  sc = SparkContext('local')

我做到了,但在Spark Scala中找不到等效的解决方案。你知道是什么导致了这个问题吗?欢迎任何帮助

  • 按如下方式配置pom.xml或build.sbt:
  • 在脚本中添加提供的作用域:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>org.example</groupId>
        <artifactId>stackOverFlowGcp</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <dependencies>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
                <version>2.2.3</version>
                <scope>provided</scope>
    
    
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.11</artifactId>
                <version>2.2.3</version>
                <scope>provided</scope>
            </dependency>
    
    
            <dependency>
                <groupId>com.typesafe</groupId>
                <artifactId>config</artifactId>
                <version>1.4.0</version>
                <scope>provided</scope>
    
            </dependency>
    
    
        </dependencies>
    
    
        <build>
            <plugins>
                <!-- Maven Plugin -->
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>2.3.2</version>
                    <configuration>
                        <source>8</source>
                        <target>8</target>
                    </configuration>
                </plugin>
                <!-- assembly Maven Plugin -->
                <plugin>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <configuration>
                        <archive>
                            <manifest>
                                <mainClass>mainPackage.mainObject</mainClass>
                            </manifest>
                        </archive>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                    </configuration>
                    <executions>
                        <execution>
                            <id>make-assembly</id>
                            <phase>package</phase>
                            <goals>
                                <goal>single</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
    
            </plugins>
    
        </build>
    
    
    </project>
    
    
  • 创建dataproc集群
  • 在dataproc中运行spark作业
  • 在dataproc中,您不会看到前面提到的结果,如果您想知道,请阅读有关dataproc方法的更多信息。但是,如果愿意,可以在dataproc中显示数据帧

    正如您在dataproc中看到的,每件事情都运行良好。
    不要忘记关闭集群或在完成后将其删除;)

    创建spark会话时不要设置master,SparkSession.builder().appName(“SparkByExamples.com”).getOrCreate()它适用于玩具示例(但在rdd2.foreach(println)时不打印)。尽管如此,在我正在处理的实际案例中,没有添加master会使流程在中断之前返回此错误。master added:WARN org.apache.spark.scheduler.TaskSetManager:0.0阶段中丢失的任务0.0(TID 0,dataproc-managed-w-91.c.wf-gcp-us-ae-dataproc-prod.internal,executor 2):java.io.InvalidClassException:com.google.cloud.spark.bigquery.SparkBigQueryConfig;本地类不兼容:stream classdesc serialVersionUID=2964184825620630609,本地类serialVersionUID=-3988734315685039601由于不进行转换就无法打印rdd行(rdd的不变性),只需兼容地检查不同的版本,我还有一个问题,您是在VM dataproc上运行齐柏林飞艇笔记本上的代码,还是使用jar?玩具示例取自这里。少了什么吗?它可以在本地完美运行并打印所需的输出。无论如何,这个例子只是为了说明我所面临的问题。我不明白你所说的本地(在你的机器中)是什么意思。在dataproc中,你能分享你的代码吗!
    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>org.example</groupId>
        <artifactId>stackOverFlowGcp</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <dependencies>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
                <version>2.2.3</version>
                <scope>provided</scope>
    
    
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.11</artifactId>
                <version>2.2.3</version>
                <scope>provided</scope>
            </dependency>
    
    
            <dependency>
                <groupId>com.typesafe</groupId>
                <artifactId>config</artifactId>
                <version>1.4.0</version>
                <scope>provided</scope>
    
            </dependency>
    
    
        </dependencies>
    
    
        <build>
            <plugins>
                <!-- Maven Plugin -->
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>2.3.2</version>
                    <configuration>
                        <source>8</source>
                        <target>8</target>
                    </configuration>
                </plugin>
                <!-- assembly Maven Plugin -->
                <plugin>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <configuration>
                        <archive>
                            <manifest>
                                <mainClass>mainPackage.mainObject</mainClass>
                            </manifest>
                        </archive>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                    </configuration>
                    <executions>
                        <execution>
                            <id>make-assembly</id>
                            <phase>package</phase>
                            <goals>
                                <goal>single</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
    
            </plugins>
    
        </build>
    
    
    </project>
    
    
    package mainPackage
    import org.apache.spark.sql.SparkSession
    
    object mainObject {
    
    
      def main(args: Array[String]): Unit = {
    
    
        val spark: SparkSession = SparkSession.builder()
          //.master("local[*]")
          .appName("SparkByExamples")
          .getOrCreate()
    
        spark.sparkContext.setLogLevel("ERROR")
    
        println("Step 1")
        val data = Seq(("Project", 1),
          ("Gutenberg’s", 1),
          ("Alice’s", 1),
          ("Adventures", 1),
          ("in", 1),
          ("Wonderland", 1),
          ("Project", 1),
          ("Gutenberg’s", 1),
          ("Adventures", 1),
          ("in", 1),
          ("Wonderland", 1),
          ("Project", 1),
          ("Gutenberg’s", 1))
    
        println("Step 2")
        val rdd = spark.sparkContext.parallelize(data)
        println("Step 3")
        val rdd2 = rdd.reduceByKey(_ + _)
    
        println("Step 4")
        rdd2.foreach(println)
    
    
      }
    }