Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hadoop 如何在向Google Dataproc提交Uber Jar时解决Guava依赖性问题_Hadoop_Apache Spark_Spark Cassandra Connector_Google Cloud Dataproc - Fatal编程技术网

Hadoop 如何在向Google Dataproc提交Uber Jar时解决Guava依赖性问题

Hadoop 如何在向Google Dataproc提交Uber Jar时解决Guava依赖性问题,hadoop,apache-spark,spark-cassandra-connector,google-cloud-dataproc,Hadoop,Apache Spark,Spark Cassandra Connector,Google Cloud Dataproc,我正在使用maven shade插件构建Uber jar,以便将其作为作业提交给google dataproc集群。 谷歌已经在其集群上安装了ApacheSpark2.0.2ApacheHadoop2.7.3 ApacheSpark2.0.2使用com.google.guava的14.0.1版本,ApacheHadoop2.7.3使用11.0.2版本,这两个版本应该已经在类路径中了 <plugin> <groupId>org.apache.mave

我正在使用maven shade插件构建Uber jar,以便将其作为作业提交给google dataproc集群。 谷歌已经在其集群上安装了ApacheSpark2.0.2ApacheHadoop2.7.3

ApacheSpark2.0.2使用com.google.guava的14.0.1版本,ApacheHadoop2.7.3使用11.0.2版本,这两个版本应该已经在类路径中了

<plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.0.0</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                    <!--  
                        <artifactSet>
                            <includes>
                                <include>com.google.guava:guava:jar:19.0</include>
                            </includes>
                        </artifactSet>
                    -->
                        <artifactSet>
                            <excludes>
                                <exclude>com.google.guava:guava:*</exclude>                                 
                            </excludes>
                        </artifactSet>
                    </configuration>
                </execution>
            </executions>
        </plugin>
如果我排除Guava16.0.1,它就会抛出这个异常

Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/reflect/TypeParameter
at com.datastax.driver.core.SanityChecks.checkGuava(SanityChecks.java:50)
at com.datastax.driver.core.SanityChecks.check(SanityChecks.java:36)
at com.datastax.driver.core.Cluster.<clinit>(Cluster.java:67)
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.clusterBuilder(CassandraConnectionFactory.scala:35)
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createCluster(CassandraConnectionFactory.scala:92)
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:154)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:149)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:149)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:82)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:121)
at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:322)
at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:342)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef(CassandraTableRowReaderProvider.scala:50)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify(CassandraTableRowReaderProvider.scala:137)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:60)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:232)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1913)
at org.apache.spark.rdd.RDD.count(RDD.scala:1134)
at com.test.scala.CreateVirtualTable$.main(CreateVirtualTable.scala:47)
at com.test.scala.CreateVirtualTable.main(CreateVirtualTable.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.reflect.TypeParameter
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 38 more
17/05/11 08:24:00 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector@edc6a5d{HTTP/1.1}{0.0.0.0:4040}
17/05/11 08:24:00 INFO com.datastax.spark.connector.util.SerialShutdownHooks: Successfully executed shutdown hook: Clearing session cache for C* connector
线程“main”java.lang.NoClassDefFoundError中的异常:com/google/common/reflect/TypeParameter 位于com.datastax.driver.core.SanityChecks.checkGuava(SanityChecks.java:50) 位于com.datastax.driver.core.SanityChecks.check(SanityChecks.java:36) 位于com.datastax.driver.core.Cluster.(Cluster.java:67) 在com.datastax.spark.connector.cql.DefaultConnectionFactory$.clusterBuilder上(CassandraConnectionFactory.scala:35) 在com.datastax.spark.connector.cql.DefaultConnectionFactory$.createCluster(CassandraConnectionFactory.scala:92) 在com.datasax.spark.connector.cql.CassandraConnector$.com$datasax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:154) 在com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply上(CassandraConnector.scala:149) 在com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply上(CassandraConnector.scala:149) 在com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)上 位于com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56) 在com.datasax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:82) 在com.datasax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110) 在com.datasax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:121)上 位于com.datasax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:322) 在com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:342) 在com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef上(CassandraTableRowReaderProvider.scala:50) 在com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute上(CassandraTableScanRDD.scala:60) 在com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60) 在com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify上(CassandraTableRowReaderProvider.scala:137) 在com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify上(CassandraTableScanRDD.scala:60) 在com.datasax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:232) 位于org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:248) 位于org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:246) 位于scala.Option.getOrElse(Option.scala:121) 位于org.apache.spark.rdd.rdd.partitions(rdd.scala:246) 位于org.apache.spark.SparkContext.runJob(SparkContext.scala:1913) 位于org.apache.spark.rdd.rdd.count(rdd.scala:1134) 位于com.test.scala.CreateVirtualTable$.main(CreateVirtualTable.scala:47) 位于com.test.scala.CreateVirtualTable.main(CreateVirtualTable.scala) 在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处 位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中 位于java.lang.reflect.Method.invoke(Method.java:498) 位于org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) 位于org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) 位于org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) 位于org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) 位于org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 原因:java.lang.ClassNotFoundException:com.google.common.reflect.TypeParameter 位于java.net.URLClassLoader.findClass(URLClassLoader.java:381) 位于java.lang.ClassLoader.loadClass(ClassLoader.java:424) 位于java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 38多 11月17日08:05:24:00信息org.spark_project.jetty.server.ServerConnector:已停止ServerConnector@edc6a5d{HTTP/1.1}{0.0.0.0:4040} 17/05/11 08:24:00 INFO com.datastax.spark.connector.util.SerialShutdownHooks:已成功执行关机挂钩:清除C*连接器的会话缓存 那么这里有什么问题呢? dataproc上的classloader是否从hadoop中选择了Guava11.0.2? 因为guava 11.0.2没有com/google/common/reflect/TypeParameter类。
所有关注此标签的google dataproc开发人员请提供帮助。

已编辑:有关Maven和SBT的完整示例,请参阅

原始答案 当我在Hadoop/Spark/Dataproc上运行uber JAR时,我经常使用适合我需要的任何版本的guava,然后使用阴影重新定位,允许不同版本共存而不会出现问题:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <version>2.3</version>
  <executions>
    <execution>
      <phase>package</phase>
      <goals>
        <goal>shade</goal>
      </goals>
      <configuration>
      <artifactSet>
          <includes>
            <include>com.google.guava:*</include>
          </includes>
      </artifactSet>
      <minimizeJar>false</minimizeJar>
      <relocations>
          <relocation>
            <pattern>com.google.common</pattern>
            <shadedPattern>repackaged.com.google.common</shadedPattern>
          </relocation>
      </relocations>
      <shadedArtifactAttached>true</shadedArtifactAttached>
      </configuration>
  </execution>
</executions>
</plugin>

org.apache.maven.plugins
maven阴影插件
2.3
包裹
阴凉处
com.google.guava:*
假的
com.google.common
重新打包。com.google.common
真的

谢谢你,它起作用了:)这个类包的重命名是如何工作的?spark现在怎么知道从类路径中选择这些类,而不是从Hadoop的11.0.2中选择这些类?在使用重定位时,shade将重写您的类以使用名为“repackaged.com.google.common”的新包,并将您的guava版本放在该包下。hadoop中的guava版本仍将使用com.google.common包,并且不会与uber jar冲突,因为该包中不再包含类。
<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <version>2.3</version>
  <executions>
    <execution>
      <phase>package</phase>
      <goals>
        <goal>shade</goal>
      </goals>
      <configuration>
      <artifactSet>
          <includes>
            <include>com.google.guava:*</include>
          </includes>
      </artifactSet>
      <minimizeJar>false</minimizeJar>
      <relocations>
          <relocation>
            <pattern>com.google.common</pattern>
            <shadedPattern>repackaged.com.google.common</shadedPattern>
          </relocation>
      </relocations>
      <shadedArtifactAttached>true</shadedArtifactAttached>
      </configuration>
  </execution>
</executions>
</plugin>