Apache spark 使用Spark 1.6.1 Hadoop 2.7.2的Google Dataproc中带空记录的动画流

Apache spark 使用Spark 1.6.1 Hadoop 2.7.2的Google Dataproc中带空记录的动画流,apache-spark,pyspark,hadoop2,amazon-kinesis,google-cloud-dataproc,Apache Spark,Pyspark,Hadoop2,Amazon Kinesis,Google Cloud Dataproc,我试图从Google Dataproc连接到Amazon Kinesis流,但只得到空的RDD Command: spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX 详细日志: 更多详细信息 Spark 1.6.1

我试图从Google Dataproc连接到Amazon Kinesis流,但只得到空的RDD

Command: spark-submit  --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX        --awsSecretKey XXXX
详细日志:

更多详细信息
Spark 1.6.1
Hadoop 2.7.2
使用的程序集:/usr/lib/spark/lib/spark-Assembly-1.6.1-hadoop2.7.2.jar

令人惊讶的是,当我下载并使用包含SPARK 1.6.1和Hadoop 2.6.0的程序集并使用以下命令时,这一点可以正常工作

Command: SPARK_HOME=/opt/spark-1.6.1-bin-hadoop2.6 spark-submit  --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX        --awsSecretKey XXXX
我不确定两个hadoop版本和Kinesis ASL之间是否存在版本冲突,或者与Google Dataproc的自定义设置有关

任何帮助都将不胜感激

谢谢

Suren

我们的团队也遇到了类似的情况,我们设法解决了这个问题:

我们在相同的环境中运行:

  • 带有Spark 1.6.1和Hadoop 2.7的DataProc映像版本1
  • 一个简单的SparkStream动情脚本可以归结为:

    # Run the script as
    # spark-submit  \
    #    --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1\
    #    demo_kinesis_streaming.py\
    #    --awsAccessKeyId FOO\
    #    --awsSecretKey BAR\
    #    ... 
    
    import argparse
    
    from pyspark import SparkContext
    from pyspark.streaming import StreamingContext
    from pyspark.storagelevel import StorageLevel
    
    from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
    
    ap = argparse.ArgumentParser()
    ap.add_argument('--awsAccessKeyId', required=True)
    ap.add_argument('--awsSecretKey', required=True)
    ap.add_argument('--stream_name')
    ap.add_argument('--region')
    ap.add_argument('--app_name')
    ap = ap.parse_args()
    
    kinesis_application_name = ap.app_name
    kinesis_stream_name = ap.stream_name
    kinesis_region = ap.region
    kinesis_endpoint_url = 'https://kinesis.{}.amazonaws.com'.format(ap.region)
    
    spark_context = SparkContext(appName=kinesis_application_name)
    streamingContext = StreamingContext(spark_context, 60)
    
    kinesisStream = KinesisUtils.createStream(
        ssc=streamingContext,
        kinesisAppName=kinesis_application_name,
        streamName=kinesis_stream_name,
        endpointUrl=kinesis_endpoint_url,
        regionName=kinesis_region,
        initialPositionInStream=InitialPositionInStream.TRIM_HORIZON,
        checkpointInterval=60,
        storageLevel=StorageLevel.MEMORY_AND_DISK_2,
        awsAccessKeyId=ap.awsAccessKeyId,
        awsSecretKey=ap.awsSecretKey
    )
    
    kinesisStream.pprint()
    
    streamingContext.start()
    streamingContext.awaitTermination()
    
  • 该代码已在AWS EMR和本地环境上使用相同的Spark 1.6.1和Hadoop 2.7设置进行了测试

  • 当DataProc上的Kinesis流中存在数据时,脚本返回空RDD而不打印任何错误
  • 我们已经用以下环境在DataProc上测试了它,但是没有一个起作用。
  • 通过
    gcloud
    命令提交作业
  • ssh
    进入集群主节点,在
    warn
    客户端模式下运行
  • ssh
    进入集群主节点,并作为
    local[*]
    运行

通过使用以下值更新
/etc/spark/conf/log4.properties
启用详细日志记录时:

    log4j.rootCategory=DEBUG, console
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
    log4j.logger.org.eclipse.jetty=ERROR
    log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
    log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=DEBUG
    log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=DEBUG
    log4j.logger.org.apache.spark=DEBUG 
    log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=DEBUG
    log4j.logger.org.spark-project.jetty.server.handler.ContextHandler=DEBUG
    log4j.logger.org.apache=DEBUG
    log4j.logger.com.amazonaws=DEBUG

我们注意到日志中有一些奇怪的地方(注意,
spark-streaming-kinesis-asl_2.10:1.6.1
使用
aws sdk java/1.9.37
作为依赖项,而不知何故,
aws sdk java/1.7.4
被[用户代理建议]:


看起来DataProc已经用一个旧得多的AWS SDK作为依赖项构建了自己的Spark,当与需要更新得多的AWS SDK版本的代码结合使用时,它会爆炸,尽管我们不确定到底是哪个模块导致了这个错误

更新: 根据@DennisHuo的评论,这种行为是由Hadoop的泄漏类路径引起的:

最糟糕的是,AWS KCL 1.4.0(Spark 1.6.1使用)没有抛出
RuntimeException
,而且在调试时会引起很多麻烦


最终,我们的解决方案是构建我们的
org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1
及其所有
com.amazonaws.

使用以下pom(更新
spark/extra/kinesis asl/pom.xml
)构建JAR,并在
spark submit
中使用
--jars
标记创建新的JAR

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-parent_2.10</artifactId>
    <version>1.6.1</version>
    <relativePath>../../pom.xml</relativePath>
  </parent>

  <!-- Kinesis integration is not included by default due to ASL-licensed code. -->
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
  <packaging>jar</packaging>
  <name>Spark Kinesis Integration</name>

  <properties>
    <sbt.project.name>streaming-kinesis-asl</sbt.project.name>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
      <type>test-jar</type>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
      <type>test-jar</type>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>amazon-kinesis-client</artifactId>
      <version>${aws.kinesis.client.version}</version>
    </dependency>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>amazon-kinesis-producer</artifactId>
      <version>${aws.kinesis.producer.version}</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.mockito</groupId>
      <artifactId>mockito-core</artifactId>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalacheck</groupId>
      <artifactId>scalacheck_${scala.binary.version}</artifactId>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-test-tags_${scala.binary.version}</artifactId>
    </dependency>
  </dependencies>

  <build>
    <outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
    <testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>

    <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <configuration>
            <shadedArtifactAttached>false</shadedArtifactAttached>

            <artifactSet>
              <includes>
                <!-- At a minimum we must include this to force effective pom generation -->
                <include>org.spark-project.spark:unused</include>
                <include>com.amazonaws:*</include>
              </includes>
            </artifactSet>

            <relocations>
              <relocation>
                <pattern>com.amazonaws</pattern>
                <shadedPattern>foo.bar.YO.com.amazonaws</shadedPattern>
                <includes>
                  <include>com.amazonaws.**</include>
                </includes>
              </relocation>
            </relocations>

          </configuration>
          <executions>
            <execution>
              <phase>package</phase>
              <goals>
                <goal>shade</goal>
              </goals>
            </execution>
          </executions>
        </plugin>
    </plugins>
  </build>
</project>

4.0.0
org.apache.spark
spark-parent_2.10
1.6.1
../../pom.xml
org.apache.spark
spark-streaming-kinesis-asl_2.10
罐子
火花运动积分
流式运动
org.apache.spark
spark-streaming_${scala.binary.version}
${project.version}
org.apache.spark
spark-core{scala.binary.version}
${project.version}
试验罐
测试
org.apache.spark
spark-streaming_${scala.binary.version}
${project.version}
试验罐
测试
亚马逊网站
亚马逊运动客户端
${aws.kinisis.client.version}
亚马逊网站
亚马逊运动生产者
${aws.kinisis.producer.version}
测试
org.mockito
莫基托磁芯
测试
org.scalacheck
scalacheck{scala.binary.version}
测试
org.apache.spark
spark-test-tags_${scala.binary.version}
target/scala-${scala.binary.version}/classes
target/scala-${scala.binary.version}/test类
org.apache.maven.plugins
maven阴影插件
假的
spark项目:未使用
com.amazonaws:*
亚马逊网站
foo.bar.YO.com.amazonaws
com.amazonaws**
包裹
阴凉处

您还可以提供
demo\u kinesis\u streaming.py
的内容吗?查看您的日志,我看到:
spark.submit.deployMode->client spark.master->local[*]
,这意味着出于某种原因,您的spark submit没有使用集群的实际spark设置(假设您在Dataproc集群上运行spark submit);可能有什么东西正在覆盖以将spark.master设置为本地?从这个和这个[AWS论坛问题](它似乎是空的)可能与没有足够的执行器运行来接收和处理数据有关。如果您以某种方式获得本地[2]因此,可能无法处理您的流是有意义的。此外,在dataproc群集上,您需要确保您有足够大的群集来容纳足够多的执行者。您好@DennisHuo,感谢您的回复。以下是链接。关于您对
客户端spark.master
成为
本地[*]的评论
,我已经通过日志(
ssh
)测试了这个脚本进入Dataproc主节点并进行了
spark提交
。感谢SurenIn,如果我下面的评论被掩埋,有问题的旧aws java sdk版本就来自于此,这使得在发行版中很难隐藏,而不会潜在地破坏其他无意中依赖泄漏类路径的用户。我们可以将Hadoop版本的AWS SDK来自
<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <parent>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-parent_2.10</artifactId>
    <version>1.6.1</version>
    <relativePath>../../pom.xml</relativePath>
  </parent>

  <!-- Kinesis integration is not included by default due to ASL-licensed code. -->
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
  <packaging>jar</packaging>
  <name>Spark Kinesis Integration</name>

  <properties>
    <sbt.project.name>streaming-kinesis-asl</sbt.project.name>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
      <type>test-jar</type>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
      <type>test-jar</type>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>amazon-kinesis-client</artifactId>
      <version>${aws.kinesis.client.version}</version>
    </dependency>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>amazon-kinesis-producer</artifactId>
      <version>${aws.kinesis.producer.version}</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.mockito</groupId>
      <artifactId>mockito-core</artifactId>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalacheck</groupId>
      <artifactId>scalacheck_${scala.binary.version}</artifactId>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-test-tags_${scala.binary.version}</artifactId>
    </dependency>
  </dependencies>

  <build>
    <outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
    <testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>

    <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <configuration>
            <shadedArtifactAttached>false</shadedArtifactAttached>

            <artifactSet>
              <includes>
                <!-- At a minimum we must include this to force effective pom generation -->
                <include>org.spark-project.spark:unused</include>
                <include>com.amazonaws:*</include>
              </includes>
            </artifactSet>

            <relocations>
              <relocation>
                <pattern>com.amazonaws</pattern>
                <shadedPattern>foo.bar.YO.com.amazonaws</shadedPattern>
                <includes>
                  <include>com.amazonaws.**</include>
                </includes>
              </relocation>
            </relocations>

          </configuration>
          <executions>
            <execution>
              <phase>package</phase>
              <goals>
                <goal>shade</goal>
              </goals>
            </execution>
          </executions>
        </plugin>
    </plugins>
  </build>
</project>