Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark数据帧复制行基于scala中拆分列值_Scala_Apache Spark_Apache Spark Sql_Databricks - Fatal编程技术网

Spark数据帧复制行基于scala中拆分列值

Spark数据帧复制行基于scala中拆分列值,scala,apache-spark,apache-spark-sql,databricks,Scala,Apache Spark,Apache Spark Sql,Databricks,我在scala中有以下代码: val fullCertificateSourceDf = certificateSourceDf .withColumn("Stage", when(col("Data.WorkBreakdownUp1Summary").isNotNull && col("Data.WorkBreakdownUp1Summary")=!="",

我在scala中有以下代码:

val  fullCertificateSourceDf = certificateSourceDf         
              .withColumn("Stage", when(col("Data.WorkBreakdownUp1Summary").isNotNull && col("Data.WorkBreakdownUp1Summary")=!="",                                                     rtrim(regexp_extract($"Data.WorkBreakdownUp1Summary","^.*?(?= - *[a-zA-Z])",0))).otherwise(""))
              .withColumn("SubSystem", when(col("Data.ProcessBreakdownSummaryList").isNotNull && col("Data.ProcessBreakdownSummaryList")=!="",                                         regexp_extract($"Data.ProcessBreakdownSummaryList","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
              .withColumn("System", when(col("Data.ProcessBreakdownUp1SummaryList").isNotNull && col("Data.ProcessBreakdownUp1SummaryList")=!="",                                         regexp_extract($"Data.ProcessBreakdownUp1SummaryList","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
              .withColumn("Facility", when(col("Data.ProcessBreakdownUp2Summary").isNotNull && col("Data.ProcessBreakdownUp2Summary")=!="",                                         regexp_extract($"Data.ProcessBreakdownUp2Summary","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
              .withColumn("Area", when(col("Data.ProcessBreakdownUp3Summary").isNotNull && col("Data.ProcessBreakdownUp3Summary")=!="",                                         regexp_extract($"Data.ProcessBreakdownUp3Summary","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
              .select("Data.ID",
                      "Data.CertificateID",
                      "Data.CertificateTag",
                      "Data.CertificateDescription",
                      "Data.WorkBreakdownUp1Summary",
                      "Data.ProcessBreakdownSummaryList",
                      "Data.ProcessBreakdownUp1SummaryList",
                      "Data.ProcessBreakdownUp2Summary",
                      "Data.ProcessBreakdownUp3Summary",
                      "Data.ActualStartDate",
                      "Data.ActualEndDate",
                      "Data.ApprovedDate",
                      "Data.CurrentState",
                      "DataType",
                      "PullDate",
                      "PullTime",
                      "Stage",
                      "System",
                      "SubSystem",
                      "Facility",
                      "Area"
                     )
                     .filter((col("Stage").isNotNull) && (length(col("Stage"))>0))
                     .filter(((col("SubSystem").isNotNull) && (length(col("SubSystem"))>0)) || ((col("System").isNotNull) && (length(col("System"))>0)) || ((col("Facility").isNotNull) && (length(col("Facility"))>0)) || ((col("Area").isNotNull) && (length(col("Area"))>0))
                      )
                     .select("*")
此dataframe fullCertificateSourceDf包含以下数据:

为了简洁起见,我隐藏了一些专栏

我希望数据如下所示:

我们分为两列:ProcessBreakdownSummaryList和ProcessBreakdownP1SummaryList。它们都是逗号分隔的列表

请注意,如果过程故障汇总表(CS10-100-22-10-矿井进气风扇加热器系统,CS10-100-81-10-矿井服务开关设备)和过程故障汇总表(CS10-100-22-服务井通风,CS10-100-81-服务井电气)中的值相同,我们只应拆分一次

但是,如果它们不同于过程故障汇总表(CS10-100-22-10-矿井进气风扇加热器系统,CS10-100-81-10-矿井服务开关设备)和过程故障汇总表(CS10-100-22-服务井通风,CS10-100-34-服务井电气)它应该再次拆分为第三行


提前感谢您在这方面的帮助。

您可以通过多种方式解决它,我认为对于复杂的处理,最简单的方法是使用scala。您可以读取所有列,包括“ProcessBreakdownSummaryList”和“ProcessBreakdownUp1SummaryList”,比较它们的值是否相同/不同,并为单个输入行发出多行。然后对输出进行flatmap,以获得包含所有所需行的数据帧

val fullCertificateSourceDf = // your code

fullCertificateSourceDf.map{ row =>
val id = row.getAs[String]("Data.ID")
... read all columns

val processBreakdownSummaryList = row.getAs[String]("Data.ProcessBreakdownSummaryList")
val processBreakdownUp1SummaryList = row.getAs[String]("Data.ProcessBreakdownUp1SummaryList")

//split processBreakdownSummaryList on ","
//split processBreakdownUp1SummaryList on ","
//compare then for equality 
//lets say you end up with 4 rows.

//return Seq of those 4 rows in a list processBreakdownSummary
//return a List of tuple of strings like List((id, certificateId, certificateTag, ..distinct values of processBreakdownUp1SummaryList...), (...) ...)
//all columns id, certificateId, certificateTag etc are repeated for each distinct value of processBreakdownUp1SummaryList and processBreakdownSummaryList

}.flatMap(identity(_)).toDF("column1","column2"...)

下面是一个将一行拆分为多行的示例

    val employees = spark.createDataFrame(Seq(("E1",100.0,"a,b"), ("E2",200.0,"e,f"),("E3",300.0,"c,d"))).toDF("employee","salary","clubs")

    employees.map{ r =>
      val clubs = r.getAs[String]("clubs").split(",")
      for{
        c : String <- clubs
      }yield(r.getAs[String]("employee"),r.getAs[Double]("salary"), c)
    }.flatMap(identity(_)).toDF("employee","salary","clubs").show(false)

谢谢你,萨利姆。如果您能举个例子,我们将不胜感激。请查看提供的例子。这太棒了@Salim。您将如何处理此问题(使用null)`val employees=spark.createDataFrame(Seq(((“E1”,100.0,“a,b”),(“E2”,200.0,“e,f”),(“E3”,300.0,“c,d”),(“E4”,300.0,null))。toDF(“员工”,“工资”,“俱乐部”)`请避免数字列为null,scala不会长时间识别null。而是对数值使用选项或默认值。对于字符串,可以使用null或默认值。我可以给你举个例子,如果你问另一个问题或者投票给这个问题。
+--------+------+-----+
|employee|salary|clubs|
+--------+------+-----+
|E1      |100.0 |a    |
|E1      |100.0 |b    |
|E2      |200.0 |e    |
|E2      |200.0 |f    |
|E3      |300.0 |c    |
|E3      |300.0 |d    |
+--------+------+-----+