Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 按顺序连接两个数据帧,并使用删除操作删除重复的行和行_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 按顺序连接两个数据帧,并使用删除操作删除重复的行和行

Scala 按顺序连接两个数据帧,并使用删除操作删除重复的行和行,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有两个从两个文本文件创建的数据帧。 df1在这里: +--------------------+----------+------+---------+--------+------+ |EntryDate | OgId | ItemId | segmentId | Sequence | Action| +--------------------+----------+------+---------+--------+------+ |2017-06-07T09:04:…| 42958773

我有两个从两个文本文件创建的数据帧。 df1在这里:

+--------------------+----------+------+---------+--------+------+
|EntryDate | OgId | ItemId | segmentId | Sequence | Action|
+--------------------+----------+------+---------+--------+------+
|2017-06-07T09:04:…| 4295877341 | 136 | 4 | 1 | I |||
|2017-06-07T09:04:…| 4295877346 | 136 | 4 | 1 | I |||
|2017-06-07T09:04:…| 4295877341 | 138 | 2 | 1 | I |||
|2017-06-07T09:04:…| 4295877341 | 141 | 4 | 1 | I |||
|2017-06-07T09:04:…| 4295877341 | 143 | 2 | 1 | I |||
|2017-06-07T09:04:…| 4295877341 | 145 | 14 | 1 | I |||
|2017-06-07T09:04:…| 123456789 | 145 | 14 | 1 | I |||
+--------------------+----------+------+---------+--------+------+
df2在这里:

+--------------------+----------+------+---------+--------+------+
|EntryDate值| OgId | ItemId | segmentId | Sequence | Action|
+--------------------+----------+------+---------+--------+------+
|2017-06-07T09:04:…| 4295877341 | 136 | 4 | 1 | I |||
|2017-06-07T09:05:…| 4295877341 | 136 | 5 | 2 | I |||
|2017-06-07T09:06:…| 4295877341 | 138 | 4 | 5 | I |||
|2017-06-07T09:07:…| 4295877341 | 141 | 9 | 1 | I |||
|2017-06-07T09:08:…| 4295877341 | 143 |空| 2 | I |||
|2017-06-07T09:09:…| 4295877343 | 149 | 14 | 2 | I |||
|2017-06-07T09:10:…| 123456789 | 145 | 14 | 1 | D |||
+--------------------+----------+------+---------+--------+------+
现在,我必须以这样一种方式连接这两个数据帧,即最终的数据帧将具有唯一的记录

此外,如果df2有任何列值为null,则df1对应的值应在最终输出中

这里动作标记“U”表示更新,“D”表示删除

我的最终输出文件如下所示

+----------+------+---------+--------+------+
|OgId | ItemId | segmentId | Sequence | Action|
+----------+------+---------+--------+------+
|4295877341 | 136 | 5 | 2 | I |||
|4295877346 | 136 | 4 | 1 | I |||
|4295877341 | 138 | 4 | 5 | I |||
|4295877341 | 141 | 9 | 1 | I |||
|4295877341 | 143 | 2 | 2 | I |||
|4295877341 | 145 | 14 | 1 | I |||
|4295877343 | 149 | 14 | 2 | I |||
+----------+------+---------+--------+------+
这两个数据帧的主键都是OgId+ItemId

以下是我从其中一个答案中得到的信息

val tempdf = df2.select("OgId").withColumnRenamed("OgId", "OgId_1")

    df1 = df1.join(tempdf, df1("OgId") === tempdf("OgId_1"), "left")
    df1 = df1.filter("OgId_1 is null").drop("OgId_1")
    df1 = df1.unionAll(df2).distinct()
    df1.show()
但是我想按照EntryDate的顺序用df2更新df1

例如,4295877341 | 136有两个更新,因此将按照df2中数据的相同顺序从df2进行更新

这是因为某些行的某些时候先更新,然后删除。所以,如果发生删除,则编辑将抛出错误,因为它将找不到要更新的行

最后,如果操作是'D',那么DF1中的行将被删除,同样,这也应该以正确的顺序发生

我希望我的问题清楚

正在更新建议的答案代码

package sparkSql

import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._

object PcfpDiff {

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("WordCount").setMaster("local[*]"); //Creating spark configuration
    // val conf = new SparkConf().setAppName("WordCount"); 
    conf.set("spark.shuffle.blockTransferService", "nio")
    val sc = new SparkContext(conf); //Creating spark context
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

    import org.apache.spark.{ SparkConf, SparkContext }
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types.{ StructType, StructField, StringType, DoubleType, IntegerType }
    import org.apache.spark.sql.functions.udf

    val schema = StructType(Array(

      StructField("OrgId", StringType),
      StructField("ItemId", StringType),
      StructField("segmentId", StringType),
      StructField("Sequence", StringType),
      StructField("Action", StringType)))

    import org.apache.spark.sql.functions._

    val textRdd1 = sc.textFile("/home/cloudera/TRF/pcfp/Text1.txt")
    val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split("\\|\\^\\|", -1)))
    var df1 = sqlContext.createDataFrame(rowRdd1, schema)

    val textRdd2 = sc.textFile("/home/cloudera/TRF/pcfp/Text2.txt")
    val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split("\\|\\^\\|", -1)))
    var df2 = sqlContext.createDataFrame(rowRdd2, schema)

    val tempdf2 = df2.withColumnRenamed("segmentId", "segmentId_1").withColumnRenamed("Sequence", "Sequence_1").withColumnRenamed("Action", "Action_1")

    df1.join(tempdf2, Seq("OrgId", "ItemId"), "outer")
      .select($"OrgId", $"ItemId",
        when($"segmentId_1".isNotNull, $"segmentId_1").otherwise($"segmentId").as("segmentId"),
        when($"Sequence_1".isNotNull, $"Sequence_1").otherwise($"Sequence").as("Sequence"),
        when($"Action_1".isNotNull, $"Action_1").otherwise($"Action").as("Action"))

    df1.show()

  }
}
而且我的产量越来越低。。。 未更新Segmentid和SequenceId

+----------+------+---------+--------+------+
|4295877341 | 136 | 4 | 1 | I |||
|4295877346 | 136 | 4 | 1 | I |||
|4295877341 | 138 | 2 | 1 | I |||
|4295877341 | 141 | 4 | 1 | I |||
|4295877341 | 143 | 2 | 1 | I |||
|4295877341 | 145 | 14 | 1 | I |||
|123456789 | 145 | 14 | 1 | I |||
+----------+------+---------+--------+------+
数据集1

4295877341^ 136^ 4^ 1^ I ||
4295877346 | | | | 136 | | | | | 4 | | | 1 | | | I ||
4295877341 | | | 138 | | | | 2 | | | 1 | | | I ||
4295877341 | | | 141 | | | | 4 | | | 1 | | | I ||
4295877341 | | | | 143 | | | | 2 | | | | 1 | | | I ||
4295877341 | | | | 145 | | | | 14 | | | | 1 | | | I ||
123456789 ^ 145 ^ 14 ^ 1 ^ I ||
数据集2

4295877341^ 136^ 4^ 1^ I ||
4295877341 ^ 136 ^ 5 ^ 2 ^ I ||
4295877341 ^ 138 ^ 4 ^ 5 ^ I ||
4295877341 | | | 141 | | | | | | | | 9 | | | | | 1 | | | I ||
4295877341 | | | | 143 | | | |空| | | | 2 | | | I ||
4295877343 | | | | 149 | | | | 14 | | | | | 2 | | | I ||
123456789 ^ 145 ^ 14 ^ 1 ^ D ||

这是我为您准备的有效解决方案

val tempdf2 = df2.except(df1).withColumnRenamed("segmentId", "segmentId_1")
  .withColumnRenamed("Sequence", "Sequence_1")
  .withColumnRenamed("Action", "Action_1")

val df3 = df1.join(tempdf2, Seq("OrgId", "ItemId"), "outer")
  .select($"OrgId", $"ItemId",
    when($"segmentId_1" =!= "null", $"segmentId_1").otherwise($"segmentId").as("segmentId"),
    when($"Sequence_1" =!= "null", $"Sequence_1").otherwise($"Sequence").as("Sequence"),
    when($"Action_1" =!= "null", $"Action_1").otherwise($"Action").as("Action"))
  .filter(!$"Action".contains("D"))
df3.show()
我希望答案是有帮助的,如果没有,你可以采取的想法,修改根据您的需要