Scala spark streaming在s3上将base64 rdd保存为json

Scala spark streaming在s3上将base64 rdd保存为json,scala,apache-spark,spark-streaming,amazon-kinesis,Scala,Apache Spark,Spark Streaming,Amazon Kinesis,下面的scala应用程序无法将json格式的rdd保存到S3上 我有:- 具有放置在流上的复杂对象的运动流。此对象在作为Kinesis PutRecord方法的一部分放入流之前已应用了JSON.stringify() scala spark流作业从流中读取这些项 我似乎无法将流中的rdd记录保存到json文件中的S3存储桶中 在代码中,我尝试将RDD[Bytes]转换为RDD[String],然后使用spark.read.json加载,但没有成功。我尝试过各种其他组合,但似乎无法以原始格式输出到

下面的scala应用程序无法将json格式的rdd保存到S3上

我有:-

  • 具有放置在流上的复杂对象的运动流。此对象在作为Kinesis PutRecord方法的一部分放入流之前已应用了JSON.stringify()
  • scala spark流作业从流中读取这些项
  • 我似乎无法将流中的rdd记录保存到json文件中的S3存储桶中

    在代码中,我尝试将RDD[Bytes]转换为RDD[String],然后使用spark.read.json加载,但没有成功。我尝试过各种其他组合,但似乎无法以原始格式输出到S3

    import org.apache.spark._
    import org.apache.spark.sql._
    import java.util.Base64
    import org.apache.spark.storage.StorageLevel
    import org.apache.spark.streaming.{Milliseconds, StreamingContext}
    import org.apache.spark.streaming.Duration
    import org.apache.spark.streaming.kinesis._
    import org.apache.spark.streaming.kinesis.KinesisInputDStream
    
    import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
    
    object ScalaStream {
      def main(args: Array[String]): Unit = {  
            val appName = "ScalaStreamExample"
            val batchInterval = Milliseconds(2000)
            val outPath = "s3://xxx-xx--xxx/xxxx/"
    
            val spark = SparkSession
                .builder()
                .appName(appName)
                .getOrCreate()
    
            val sparkContext = spark.sparkContext
            val streamingContext = new StreamingContext(sparkContext, batchInterval)
    
            // Populate the appropriate variables from the given args
            val checkpointAppName = "xxx-xx-xx--xx"
            val streamName = "cc-cc-c--c--cc"
            val endpointUrl = "https://kinesis.xxx-xx-xx.amazonaws.com"
            val regionName = "cc-xxxx-xxx"
            val initialPosition = new Latest()
            val checkpointInterval = batchInterval
            val storageLevel = StorageLevel.MEMORY_AND_DISK_2
    
            val kinesisStream = KinesisInputDStream.builder
             .streamingContext(streamingContext)
             .endpointUrl(endpointUrl)
             .regionName(regionName)
             .streamName(streamName)
             .initialPosition(initialPosition)
             .checkpointAppName(checkpointAppName)
             .checkpointInterval(checkpointInterval)
             .storageLevel(StorageLevel.MEMORY_AND_DISK_2)
             .build()
    
            kinesisStream.foreachRDD { rdd =>
                if (!rdd.isEmpty()){
                    //**************** .  <---------------
                    // This is where i'm trying to save the raw json object to s3 as json file
                    // tried various combinations here but no luck. 
                    val dataFrame = rdd.map(record=>new String(record)) // convert bytes to string
                    dataFrame.write.mode(SaveMode.Append).json(outPath + "/" + rdd.id.toString())
                    //**************** <----------------
                }
            }
    
            // Start the streaming context and await termination
            streamingContext.start()
            streamingContext.awaitTermination()
        }
    }
    
    import org.apache.spark_
    导入org.apache.spark.sql_
    导入java.util.Base64
    导入org.apache.spark.storage.StorageLevel
    导入org.apache.spark.streaming.{毫秒,StreamingContext}
    导入org.apache.spark.streaming.Duration
    导入org.apache.spark.streaming.kinesis_
    导入org.apache.spark.streaming.kinesis.KinesisInputDStream
    导入org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
    对象缩放组{
    def main(args:Array[String]):Unit={
    val appName=“ScalaStreamExample”
    val batchInterval=毫秒(2000)
    val outPath=“s3://xxx xx--xxx/xxxx/”
    val spark=火花会话
    .builder()
    .appName(appName)
    .getOrCreate()
    val sparkContext=spark.sparkContext
    val streamingContext=新streamingContext(sparkContext,batchInterval)
    //从给定的参数填充适当的变量
    val checkpointAppName=“xxx xx xx--xx”
    val streamName=“cc-cc-c--c--cc”
    val endpointUrl=”https://kinesis.xxx-xx-xx.amazonaws.com"
    val regionName=“cc xxxx xxx”
    val initialPosition=新的最新版本()
    val检查点间隔=批处理间隔
    val storageLevel=storageLevel.MEMORY_和磁盘_2
    val kinesisStream=KinesisInputDStream.builder
    .streamingContext(streamingContext)
    .endpointUrl(endpointUrl)
    .regionName(regionName)
    .streamName(streamName)
    .初始位置(初始位置)
    .checkpointAppName(checkpointAppName)
    .检查点间隔(检查点间隔)
    .storageLevel(storageLevel.MEMORY_和磁盘_2)
    .build()
    kinesisStream.foreachRDD{rdd=>
    如果(!rdd.isEmpty()){
    
    //****************因此,它失败的原因完全是转移注意力。结果表明,它与EMR上可用的scala版本存在冲突

    提出了许多类似的问题,表明这可能是问题所在,但尽管spark文档列表2.12.4与spark 2.4.4兼容,但EMR实例似乎不支持scala版本2.12.4。因此,我已从更新了build.sbt和部署脚本

    build.sbt:

    name := "Simple Project"
    
    version := "1.0"
    
    scalaVersion := "2.12.8"
    
    ibraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
    libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
    libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
    
    致:

    deploy.sh

    aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
    --class,"ScalaStream",\
    --deploy-mode,cluster,\
    --master,yarn,\
    --packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
    --conf,spark.yarn.submit.waitAppCompletion=false,\
    --conf,yarn.log-aggregation-enable=true,\
    --conf,spark.dynamicAllocation.enabled=true,\
    --conf,spark.cores.max=4,\
    --conf,spark.network.timeout=300,\
    s3://ccc.xxxx/simple-project_2.11-1.0.jar\
    ],ActionOnFailure=CONTINUE
    

    “不走运”是什么意思?你试过调试你的代码吗?你让它工作到了什么程度了?@michaJlS很好的一点,我将用一些例子更新这个问题。最初我省略了一个例子,因为我已经用尽了太多的选项,我不知道该列出哪一个或寻求帮助。
    aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
    --class,"ScalaStream",\
    --deploy-mode,cluster,\
    --master,yarn,\
    --packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
    --conf,spark.yarn.submit.waitAppCompletion=false,\
    --conf,yarn.log-aggregation-enable=true,\
    --conf,spark.dynamicAllocation.enabled=true,\
    --conf,spark.cores.max=4,\
    --conf,spark.network.timeout=300,\
    s3://ccc.xxxx/simple-project_2.11-1.0.jar\
    ],ActionOnFailure=CONTINUE