scala spark NoClassDefFoundError-InitialPositionInStream

scala spark NoClassDefFoundError-InitialPositionInStream,scala,apache-spark,aws-sdk,spark-streaming,amazon-emr,Scala,Apache Spark,Aws Sdk,Spark Streaming,Amazon Emr,使用以下命令将scala编写的spark应用程序部署到EMR集群,我无法理解为什么在部署到EMR集群实例时会收到缺少依赖项的错误消息 错误消息: User class threw exception: java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/clientlibrary/lib/worker/InitialPositionInStream 和sbt文件 name := "Simple Project" ver

使用以下命令将scala编写的spark应用程序部署到EMR集群,我无法理解为什么在部署到EMR集群实例时会收到缺少依赖项的错误消息

错误消息:

User class threw exception: java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/clientlibrary/lib/worker/InitialPositionInStream
和sbt文件

name := "Simple Project"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.715"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
下面是部分代码

...

import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream

...

        val streamingContext = new StreamingContext(sparkContext, batchInterval)

        // Populate the appropriate variables from the given args
        val streamAppName = "xxxxxx"
        val streamName = "xxxxxx"
        val endpointUrl = "https://kinesis.xxxxx.amazonaws.com"
        val regionName = "xx-xx-x"
        val initialPosition = InitialPositionInStream.LATEST
        val checkpointInterval = batchInterval
        val storageLevel = StorageLevel.MEMORY_AND_DISK_2

        val kinesisStream = KinesisUtils.createStream(streamingContext, streamAppName, streamAppName, endpointUrl, regionName, initialPosition, checkpointInterval, storageLevel)

        val initialPosition = InitialPositionInStream.LATEST
        val checkpointInterval = batchInterval
        val storageLevel = StorageLevel.MEMORY_AND_DISK_2

        val kinesisStream = KinesisUtils.createStream(streamingContext, streamAppName, streamAppName, endpointUrl, regionName, initialPosition, checkpointInterval, storageLevel)


我已经尝试在sbt文件和spark submit的--jars参数中包含aws依赖项,但看不出为什么缺少依赖项?

通过更新以下内容进行修复

sbt

部署脚本

aws emr add-steps --cluster-id j-xxxxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.60\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://xxx.xxx/simple-project_2.12-1.0.jar\
],ActionOnFailure=CONTINUE

键是添加到
aws emr添加步骤中的--packages标志。错误地认为
sbt包
捆绑了所需的依赖项。

通过更新以下内容进行修复

sbt

部署脚本

aws emr add-steps --cluster-id j-xxxxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.60\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://xxx.xxx/simple-project_2.12-1.0.jar\
],ActionOnFailure=CONTINUE
键是添加到
aws emr添加步骤中的--packages标志。错误地认为
sbt包
捆绑了所需的依赖项

name := "Simple Project"

version := "1.0"

scalaVersion := "2.12.8"

libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
aws emr add-steps --cluster-id j-xxxxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0,org.postgresql:postgresql:42.2.9,com.facebook.presto:presto-jdbc:0.60\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://xxx.xxx/simple-project_2.12-1.0.jar\
],ActionOnFailure=CONTINUE