Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark:拆分行并累加_Scala_Apache Spark - Fatal编程技术网

Scala Spark:拆分行并累加

Scala Spark:拆分行并累加,scala,apache-spark,Scala,Apache Spark,我有以下代码: val rdd = sc.textFile(sample.log") val splitRDD = rdd.map(r => StringUtils.splitPreserveAllTokens(r, "\\|")) val rdd2 = splitRDD.filter(...).map(row => createRow(row, fieldsMap)) sqlContext.createDataFrame(rdd2, structType).save( or

我有以下代码:

val rdd = sc.textFile(sample.log")
val splitRDD = rdd.map(r => StringUtils.splitPreserveAllTokens(r, "\\|"))
val rdd2 = splitRDD.filter(...).map(row => createRow(row, fieldsMap))
sqlContext.createDataFrame(rdd2, structType).save(
    org.apache.phoenix.spark, SaveMode.Overwrite, Map("table" -> table, "zkUrl" -> zkUrl))

def createRow(row: Array[String], fieldsMap: ListMap[Int, FieldConfig]): Row = {
    //add additional index for invalidValues
    val arrSize = fieldsMap.size + 1
    val arr = new Array[Any](arrSize)
    var invalidValues = ""
    for ((k, v) <- fieldsMap) {
      val valid = ...
      var value : Any = null
      if (valid) {
        value = row(k)
        // if (v.code == "SOURCE_NAME") --> 5th column in the row
        // sourceNameCount = row(k).split(",").size
      } else {
        invalidValues += v.code + " : " + row(k) + " | "
      }
      arr(k) = value
    }
    arr(arrSize - 1) = invalidValues
    Row.fromSeq(arr.toSeq)
}
这是sample.log:

TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3|
SOURCE_TYPE1,SOURCE_TYPE2,SOURCE_TYPE3|SOURCE_COUNT1,SOURCE_COUNT2,SOURCE_COUNT3|
DEST_NAME1,DEST_NAME2,DEST_NAME3|DEST_TYPE1,DEST_TYPE2,DEST_TYPE3|
DEST_COUNT1,DEST_COUNT2,DEST_COUNT3|
目标是根据源名称的数量拆分输入sample.log。。在上面的示例中,输出将有3行:

TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1|SOURCE_TYPE1|SOURCE_COUNT1|
|DEST_NAME1|DEST_TYPE1|DEST_COUNT1|

TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME2|SOURCE_TYPE2|SOURCE_COUNT2|
DEST_NAME2|DEST_TYPE2|DEST_COUNT2|

TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME3|SOURCE_TYPE3|SOURCE_COUNT3|
|DEST_NAME3|DEST_TYPE3|DEST_COUNT3|
这是我正在使用上面定义的createRow编写的新代码:

      val rdd2 = splitRDD.filter(...).flatMap(row => {

        val srcName = row(4).split(",")
        val srcType = row(5).split(",")
        val srcCount = row(6).split(",")

        val destName = row(7).split(",")
        val destType = row(8).split(",")
        val destCount = row(9).split(",")

        var newRDD: ArrayBuffer[Row] = new ArrayBuffer[Row]()

        //if (srcName != null) {
        println("\n\nsrcName.size: " + srcName.size + "\n\n")
        for (i <- 0 to srcName.size - 1) {
          // missing column: destType can sometimes be null

          val splittedRow: Array[String] = Row.fromSeq(Seq((row(0), row(1), row(2), row(3), 
            srcName(i), srcType(i), srcCount(i), destName(i), "", destCount(i)))).toSeq.toArray[String]

          newRDD = newRDD ++ Seq(createRow(splittedRow, fieldsMap))
        }
        //}

        Seq(Row.fromSeq(Seq(newRDD)))

    })
我决定将splittedRow更新为:

        val rowArr: Array[String] = new Array[String](10)

          for (j <- 0 to 3) {
            rowArr(j) = row(j)
          }
          rowArr(4) = srcName(i)
          rowArr(5) = row(5).split(",")(i)
          rowArr(6) = row(6).split(",")(i)
          rowArr(7) = row(7).split(",")(i)
          rowArr(8) = row(8).split(",")(i)
          rowArr(9) = row(9).split(",")(i)

          val splittedRow = rowArr

可以使用flatMap操作而不是map操作来返回多行。因此,您的createRow将被重构为createRowsrow:Array[String],fieldsMap:List[Int,IngestFieldConfig]:Seq[Row]。

您好,直到我更新了我的问题以包含createRows。您能为row.map建议一个解决方案吗?因为我得到了多行?此外,对于缺少的列,您建议采用什么方法?Thanks@Sophie,我很想帮助您,但我需要更多地了解您的输入和输出格式。您的示例输出类似于主题|组|。。。但是在第一个createRow实现中,您构造了一些键值对v.code+:+rowk。请给出一个有效的例子,说明如何定义正确的输入和输出?此外,哪些列可能会丢失?嗨,直到,我更新了我的问题,包括对fieldsMap参数的解释。我的想法是重用createRow函数,并预处理我传递给它的行。因此,如果输入最初是1行,但有3个源,flatMap将生成3行,然后循环createRow 3x。然后,newRDD将在createDateFrame中传递。关于缺少的列-它基于我的新代码-destTypei。我不知道如何在ArrayRow的生成中设置destType有时为null的条件。。
error: type arguments [String] do not conform to method toArray's type parameter bounds [B >: Any]
        val rowArr: Array[String] = new Array[String](10)

          for (j <- 0 to 3) {
            rowArr(j) = row(j)
          }
          rowArr(4) = srcName(i)
          rowArr(5) = row(5).split(",")(i)
          rowArr(6) = row(6).split(",")(i)
          rowArr(7) = row(7).split(",")(i)
          rowArr(8) = row(8).split(",")(i)
          rowArr(9) = row(9).split(",")(i)

          val splittedRow = rowArr