Apache spark 使用Spark 1.6.1 Hadoop 2.7.2的Google Dataproc中带空记录的动画流
我试图从Google Dataproc连接到Amazon Kinesis流,但只得到空的RDDApache spark 使用Spark 1.6.1 Hadoop 2.7.2的Google Dataproc中带空记录的动画流,apache-spark,pyspark,hadoop2,amazon-kinesis,google-cloud-dataproc,Apache Spark,Pyspark,Hadoop2,Amazon Kinesis,Google Cloud Dataproc,我试图从Google Dataproc连接到Amazon Kinesis流,但只得到空的RDD Command: spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX 详细日志: 更多详细信息 Spark 1.6.1
Command: spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX
详细日志:
更多详细信息Spark 1.6.1
Hadoop 2.7.2
使用的程序集:/usr/lib/spark/lib/spark-Assembly-1.6.1-hadoop2.7.2.jar 令人惊讶的是,当我下载并使用包含SPARK 1.6.1和Hadoop 2.6.0的程序集并使用以下命令时,这一点可以正常工作
Command: SPARK_HOME=/opt/spark-1.6.1-bin-hadoop2.6 spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX
我不确定两个hadoop版本和Kinesis ASL之间是否存在版本冲突,或者与Google Dataproc的自定义设置有关
任何帮助都将不胜感激
谢谢Suren我们的团队也遇到了类似的情况,我们设法解决了这个问题: 我们在相同的环境中运行:
- 带有Spark 1.6.1和Hadoop 2.7的DataProc映像版本1
- 一个简单的SparkStream动情脚本可以归结为:
# Run the script as # spark-submit \ # --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1\ # demo_kinesis_streaming.py\ # --awsAccessKeyId FOO\ # --awsSecretKey BAR\ # ... import argparse from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.storagelevel import StorageLevel from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream ap = argparse.ArgumentParser() ap.add_argument('--awsAccessKeyId', required=True) ap.add_argument('--awsSecretKey', required=True) ap.add_argument('--stream_name') ap.add_argument('--region') ap.add_argument('--app_name') ap = ap.parse_args() kinesis_application_name = ap.app_name kinesis_stream_name = ap.stream_name kinesis_region = ap.region kinesis_endpoint_url = 'https://kinesis.{}.amazonaws.com'.format(ap.region) spark_context = SparkContext(appName=kinesis_application_name) streamingContext = StreamingContext(spark_context, 60) kinesisStream = KinesisUtils.createStream( ssc=streamingContext, kinesisAppName=kinesis_application_name, streamName=kinesis_stream_name, endpointUrl=kinesis_endpoint_url, regionName=kinesis_region, initialPositionInStream=InitialPositionInStream.TRIM_HORIZON, checkpointInterval=60, storageLevel=StorageLevel.MEMORY_AND_DISK_2, awsAccessKeyId=ap.awsAccessKeyId, awsSecretKey=ap.awsSecretKey ) kinesisStream.pprint() streamingContext.start() streamingContext.awaitTermination()
- 该代码已在AWS EMR和本地环境上使用相同的Spark 1.6.1和Hadoop 2.7设置进行了测试
- 当DataProc上的Kinesis流中存在数据时,脚本返回空RDD而不打印任何错误
- 我们已经用以下环境在DataProc上测试了它,但是没有一个起作用。
- 通过
命令提交作业李>gcloud
进入集群主节点,在ssh
客户端模式下运行李>warn
进入集群主节点,并作为ssh
运行local[*]
通过使用以下值更新
/etc/spark/conf/log4.properties
启用详细日志记录时:
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
log4j.logger.org.eclipse.jetty=ERROR
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=DEBUG
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=DEBUG
log4j.logger.org.apache.spark=DEBUG
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=DEBUG
log4j.logger.org.spark-project.jetty.server.handler.ContextHandler=DEBUG
log4j.logger.org.apache=DEBUG
log4j.logger.com.amazonaws=DEBUG
我们注意到日志中有一些奇怪的地方(注意,
spark-streaming-kinesis-asl_2.10:1.6.1
使用aws sdk java/1.9.37
作为依赖项,而不知何故,aws sdk java/1.7.4
被[用户代理建议]:
看起来DataProc已经用一个旧得多的AWS SDK作为依赖项构建了自己的Spark,当与需要更新得多的AWS SDK版本的代码结合使用时,它会爆炸,尽管我们不确定到底是哪个模块导致了这个错误 更新: 根据@DennisHuo的评论,这种行为是由Hadoop的泄漏类路径引起的: 最糟糕的是,AWS KCL 1.4.0(Spark 1.6.1使用)没有抛出
RuntimeException
,而且在调试时会引起很多麻烦
最终,我们的解决方案是构建我们的
org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1
及其所有com.amazonaws.
使用以下pom(更新spark/extra/kinesis asl/pom.xml
)构建JAR,并在spark submit
中使用--jars
标记创建新的JAR
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.10</artifactId>
<version>1.6.1</version>
<relativePath>../../pom.xml</relativePath>
</parent>
<!-- Kinesis integration is not included by default due to ASL-licensed code. -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
<packaging>jar</packaging>
<name>Spark Kinesis Integration</name>
<properties>
<sbt.project.name>streaming-kinesis-asl</sbt.project.name>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-client</artifactId>
<version>${aws.kinesis.client.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-producer</artifactId>
<version>${aws.kinesis.producer.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalacheck</groupId>
<artifactId>scalacheck_${scala.binary.version}</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-test-tags_${scala.binary.version}</artifactId>
</dependency>
</dependencies>
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<shadedArtifactAttached>false</shadedArtifactAttached>
<artifactSet>
<includes>
<!-- At a minimum we must include this to force effective pom generation -->
<include>org.spark-project.spark:unused</include>
<include>com.amazonaws:*</include>
</includes>
</artifactSet>
<relocations>
<relocation>
<pattern>com.amazonaws</pattern>
<shadedPattern>foo.bar.YO.com.amazonaws</shadedPattern>
<includes>
<include>com.amazonaws.**</include>
</includes>
</relocation>
</relocations>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
4.0.0
org.apache.spark
spark-parent_2.10
1.6.1
../../pom.xml
org.apache.spark
spark-streaming-kinesis-asl_2.10
罐子
火花运动积分
流式运动
org.apache.spark
spark-streaming_${scala.binary.version}
${project.version}
org.apache.spark
spark-core{scala.binary.version}
${project.version}
试验罐
测试
org.apache.spark
spark-streaming_${scala.binary.version}
${project.version}
试验罐
测试
亚马逊网站
亚马逊运动客户端
${aws.kinisis.client.version}
亚马逊网站
亚马逊运动生产者
${aws.kinisis.producer.version}
测试
org.mockito
莫基托磁芯
测试
org.scalacheck
scalacheck{scala.binary.version}
测试
org.apache.spark
spark-test-tags_${scala.binary.version}
target/scala-${scala.binary.version}/classes
target/scala-${scala.binary.version}/test类
org.apache.maven.plugins
maven阴影插件
假的
spark项目:未使用
com.amazonaws:*
亚马逊网站
foo.bar.YO.com.amazonaws
com.amazonaws**
包裹
阴凉处
您还可以提供demo\u kinesis\u streaming.py
的内容吗?查看您的日志,我看到:spark.submit.deployMode->client spark.master->local[*]
,这意味着出于某种原因,您的spark submit没有使用集群的实际spark设置(假设您在Dataproc集群上运行spark submit);可能有什么东西正在覆盖以将spark.master设置为本地?从这个和这个[AWS论坛问题](它似乎是空的)可能与没有足够的执行器运行来接收和处理数据有关。如果您以某种方式获得本地[2]因此,可能无法处理您的流是有意义的。此外,在dataproc群集上,您需要确保您有足够大的群集来容纳足够多的执行者。您好@DennisHuo,感谢您的回复。以下是链接。关于您对客户端spark.master
成为本地[*]的评论
,我已经通过日志(ssh
)测试了这个脚本进入Dataproc主节点并进行了spark提交
。感谢SurenIn,如果我下面的评论被掩埋,有问题的旧aws java sdk版本就来自于此,这使得在发行版中很难隐藏,而不会潜在地破坏其他无意中依赖泄漏类路径的用户。我们可以将Hadoop版本的AWS SDK来自
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.10</artifactId>
<version>1.6.1</version>
<relativePath>../../pom.xml</relativePath>
</parent>
<!-- Kinesis integration is not included by default due to ASL-licensed code. -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
<packaging>jar</packaging>
<name>Spark Kinesis Integration</name>
<properties>
<sbt.project.name>streaming-kinesis-asl</sbt.project.name>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-client</artifactId>
<version>${aws.kinesis.client.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-producer</artifactId>
<version>${aws.kinesis.producer.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalacheck</groupId>
<artifactId>scalacheck_${scala.binary.version}</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-test-tags_${scala.binary.version}</artifactId>
</dependency>
</dependencies>
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<shadedArtifactAttached>false</shadedArtifactAttached>
<artifactSet>
<includes>
<!-- At a minimum we must include this to force effective pom generation -->
<include>org.spark-project.spark:unused</include>
<include>com.amazonaws:*</include>
</includes>
</artifactSet>
<relocations>
<relocation>
<pattern>com.amazonaws</pattern>
<shadedPattern>foo.bar.YO.com.amazonaws</shadedPattern>
<includes>
<include>com.amazonaws.**</include>
</includes>
</relocation>
</relocations>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>