Scala 如何将数据集拆分为两个数据集,每个数据集具有唯一行和重复行?

Scala 如何将数据集拆分为两个数据集,每个数据集具有唯一行和重复行?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我想在Spark scala数据帧中获取重复记录。例如,我想根据“id”、“name”、“age”等3列获取重复的值。条件部分不包含任何列(动态输入)。根据列值,我要获取重复记录 下面的代码我已经试过了。我只尝试了一个属性。我不知道如果有多个栏目怎么办 我的代码: var s= "age|id|name " // Note- This is dynamic input. so it will increase or decrease var columnNames= s.replace('|

我想在Spark scala数据帧中获取重复记录。例如,我想根据“id”、“name”、“age”等3列获取重复的值。条件部分不包含任何列(动态输入)。根据列值,我要获取重复记录

下面的代码我已经试过了。我只尝试了一个属性。我不知道如果有多个栏目怎么办

我的代码:

 var s= "age|id|name " // Note- This is dynamic input. so it will increase or decrease
 var columnNames= s.replace('|', ',')


val findDuplicateRecordsDF= spark.sql("SELECT * FROM " + dbname + "." + tablename)
findDuplicateRecordsDF.show()
findDuplicateRecordsDF.withColumn("count", count("*")
      .over(Window.partitionBy($"id"))) // here how to add more than one column?(Dynamic input) 
      .where($"count">1)
      .show()
输入数据帧:(findDuplicateRecordsDF.show())

在这里,我将根据4列(id、姓名、电话、电子邮件)重复记录。上面的一个是示例数据帧。原始数据框不包含任何列

输出数据帧应为

  • 重复记录输出

           --------------------------------------------------------
           |  id   |  name | age |  phone      | email_id          |
           |-------------------------------------------------------|  
           |  3    | sam   | 23  |  9876543210 | sam@yahoo.com     | 
           |  3    | sam   | 28  |  9876543210 | sam@yahoo.com     | 
           |  6    | haris | 30  |  6543210777 | haris@gmail.com   |
           |  6    | haris | 24  |  6543210777 | haris@gmail.com   | 
            --------------------------------------------------------
    
  • 唯一记录数据帧输出:

          --------------------------------------------------------
         |  id   |  name | age |  phone      | email_id          |
         |-------------------------------------------------------|  
         |  7    | ram   | 27  |  8765432190 | ram@gmail.com     |
         |  9    | ram   | 27  |  8765432130 | ram94@gmail.com   |
         |  4    | karthi| 26  |  4321066666 | karthi@gmail.com  | 
          --------------------------------------------------------
    

  • 提前感谢。

    您需要提供逗号分隔的列名称

    col1 ..col2 should be of string type.
         val window= Window.partitionBy(col1,col2,..)
    
    
        findDuplicateRecordsDF.withColumn("count", count("*")
              .over(window)
              .where($"count">1)
              .show()
    

    您可以使用窗口函数。看看这个

    scala> val df = Seq((3,"sam",23,"9876543210","sam@yahoo.com"),(7,"ram",27,"8765432190","ram@gmail.com"),(3,"sam",28,"9876543210","sam@yahoo.com"),(6,"haris",30,"6543210777","haris@gmail.com"),(9,"ram",27,"8765432130","ram94@gmail.com"),(6,"haris",24,"6543210777","haris@gmail.com"),(4,"karthi",26,"4321066666","karthi@gmail.com")).toDF("id","name","age","phone","email_id")
    df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 3 more fields]
    
    scala> val dup_cols = List("id","name","phone","email_id");
    dup_cols: List[String] = List(id, name, phone, email_id)
    
    scala> df.createOrReplaceTempView("contact")
    
    scala> val dup_cols_qry = dup_cols.mkString(" count(*) over(partition by ", "," , " ) as cnt ")
    dup_cols_qry: String = " count(*) over(partition by id,name,phone,email_id ) as cnt "
    
    scala> val df2 = spark.sql("select *,"+ dup_cols_qry + " from contact ")
    df2: org.apache.spark.sql.DataFrame = [id: int, name: string ... 4 more fields]
    
    scala> df2.show(false)
    +---+------+---+----------+----------------+---+
    |id |name  |age|phone     |email_id        |cnt|
    +---+------+---+----------+----------------+---+
    |4  |karthi|26 |4321066666|karthi@gmail.com|1  |
    |7  |ram   |27 |8765432190|ram@gmail.com   |1  |
    |9  |ram   |27 |8765432130|ram94@gmail.com |1  |
    |3  |sam   |23 |9876543210|sam@yahoo.com   |2  |
    |3  |sam   |28 |9876543210|sam@yahoo.com   |2  |
    |6  |haris |30 |6543210777|haris@gmail.com |2  |
    |6  |haris |24 |6543210777|haris@gmail.com |2  |
    +---+------+---+----------+----------------+---+
    
    
    scala> df2.createOrReplaceTempView("contact2")
    
    //复制品

    scala>  spark.sql("select " + dup_cols.mkString(",") + " from contact2 where cnt = 2").show
    +---+-----+----------+---------------+
    | id| name|     phone|       email_id|
    +---+-----+----------+---------------+
    |  3|  sam|9876543210|  sam@yahoo.com|
    |  3|  sam|9876543210|  sam@yahoo.com|
    |  6|haris|6543210777|haris@gmail.com|
    |  6|haris|6543210777|haris@gmail.com|
    +---+-----+----------+---------------+
    
    //独特的

    scala>  spark.sql("select " + dup_cols.mkString(",") + " from contact2 where cnt = 1").show
    +---+------+----------+----------------+
    | id|  name|     phone|        email_id|
    +---+------+----------+----------------+
    |  4|karthi|4321066666|karthi@gmail.com|
    |  7|   ram|8765432190|   ram@gmail.com|
    |  9|   ram|8765432130| ram94@gmail.com|
    +---+------+----------+----------------+
    
    EDIT2:

    val df = Seq(
      (4,"karthi",26,"4321066666","karthi@gmail.com"),
      (6,"haris",24,"6543210777","haris@gmail.com"),
      (7,"ram",27,"8765432190","ram@gmail.com"),
      (9,"ram",27,"8765432190","ram@gmail.com"),
      (6,"haris",24,"6543210777","haris@gmail.com"),
      (3,"sam",23,"9876543210","sam@yahoo.com"),
      (3,"sam",23,"9876543210","sam@yahoo.com"),
      (3,"sam",28,"9876543210","sam@yahoo.com"),
      (6,"haris",30,"6543210777","haris@gmail.com")
      ).toDF("id","name","age","phone","email_id")
    
    val dup_cols = List("name","phone","email_id")
    val dup_cols_str = dup_cols.mkString(",")
    df.createOrReplaceTempView("contact")
    val dup_cols_count_qry = " count(*) over(partition by " + dup_cols_str + " ) as cnt "
    val dup_cols_row_num_qry = " row_number() over(partition by " + dup_cols_str + " order by " + dup_cols_str + " ) as rwn "
    val df2 = spark.sql("select *,"+ dup_cols_count_qry + "," + dup_cols_row_num_qry + " from contact ")
    df2.show(false)
    df2.createOrReplaceTempView("contact2")
    spark.sql("select id, " + dup_cols_str + " from contact2 where cnt > 1 and rwn > 1").show
    
    结果:

    +---+-----+----------+---------------+
    | id| name|     phone|       email_id|
    +---+-----+----------+---------------+
    |  6|haris|6543210777|haris@gmail.com|
    |  6|haris|6543210777|haris@gmail.com|
    |  3|  sam|9876543210|  sam@yahoo.com|
    |  3|  sam|9876543210|  sam@yahoo.com|
    |  9|  ram|8765432190|  ram@gmail.com|
    +---+-----+----------+---------------+
    
    +---+------+---+----------+----------------+----------+
    |id |name  |age|phone     |email_id        |null_count|
    +---+------+---+----------+----------------+----------+
    |4  |karthi|26 |4321066666|karthi@gmail.com|3         |
    |6  |haris |30 |6543210777|haris@gmail.com |3         |
    |6  |haris |30 |null      |haris@gmail.com |2         |
    |7  |ram   |27 |8765432190|ram@gmail.com   |3         |
    |9  |ram   |27 |8765432190|ram@gmail.com   |3         |
    |6  |haris |24 |6543210777|haris@gmail.com |3         |
    |6  |null  |24 |6543210777|null            |1         |
    |3  |sam   |23 |9876543210|sam@yahoo.com   |3         |
    |3  |sam   |23 |9876543210|sam@yahoo.com   |3         |
    |3  |sam   |28 |9876543210|sam@yahoo.com   |3         |
    |6  |haris |24 |6543210777|haris@gmail.com |3         |
    +---+------+---+----------+----------------+----------+
    
    
    |id |name  |age|phone     |email_id        |null_count|cnt|rwn|
    +---+------+---+----------+----------------+----------+---+---+
    |6  |haris |30 |6543210777|haris@gmail.com |3         |3  |1  |
    |6  |haris |24 |6543210777|haris@gmail.com |3         |3  |2  |
    |6  |haris |24 |6543210777|haris@gmail.com |3         |3  |3  |
    |3  |sam   |23 |9876543210|sam@yahoo.com   |3         |3  |1  |
    |3  |sam   |23 |9876543210|sam@yahoo.com   |3         |3  |2  |
    |3  |sam   |28 |9876543210|sam@yahoo.com   |3         |3  |3  |
    |7  |ram   |27 |8765432190|ram@gmail.com   |3         |2  |1  |
    |9  |ram   |27 |8765432190|ram@gmail.com   |3         |2  |2  |
    |4  |karthi|26 |4321066666|karthi@gmail.com|3         |1  |1  |
    +---+------+---+----------+----------------+----------+---+---+
    
    +-----+----------+---------------+---+---+
    |name |phone     |email_id       |id |age|
    +-----+----------+---------------+---+---+
    |haris|6543210777|haris@gmail.com|6  |24 |
    |haris|6543210777|haris@gmail.com|6  |24 |
    |sam  |9876543210|sam@yahoo.com  |3  |23 |
    |sam  |9876543210|sam@yahoo.com  |3  |28 |
    |ram  |8765432190|ram@gmail.com  |9  |27 |
    +-----+----------+---------------+---+---+
    
    EDIT3:-空条件检查

    val df = Seq(
      (4,"karthi",26,"4321066666","karthi@gmail.com"),
      (6,"haris",30,"6543210777","haris@gmail.com"),
      (6,"haris",30,null,"haris@gmail.com"),
      (7,"ram",27,"8765432190","ram@gmail.com"),
      (9,"ram",27,"8765432190","ram@gmail.com"),
      (6,"haris",24,"6543210777","haris@gmail.com"),
      (6,null,24,"6543210777",null),
      (3,"sam",23,"9876543210","sam@yahoo.com"),
      (3,"sam",23,"9876543210","sam@yahoo.com"),
      (3,"sam",28,"9876543210","sam@yahoo.com"),
      (6,"haris",24,"6543210777","haris@gmail.com")
    ).toDF("id","name","age","phone","email_id")
    
    val all_cols = df.columns
    val dup_cols = List("name","phone","email_id")
    val rem_cols = all_cols.diff(dup_cols)
    val dup_cols_str = dup_cols.mkString(",")
    val rem_cols_str = rem_cols.mkString(",")
    val dup_cols_length = dup_cols.length
    val df_null_col = dup_cols.map( x => when(col(x).isNull,0).otherwise(1)).reduce( _ + _ )
    val df_null = df.withColumn("null_count", df_null_col)
    df_null.createOrReplaceTempView("contact")
    df_null.show(false)
    
    val dup_cols_count_qry = " count(*) over(partition by " + dup_cols_str + " ) as cnt "
    val dup_cols_row_num_qry = " row_number() over(partition by " + dup_cols_str + " order by " + dup_cols_str + " ) as rwn "
    val df2 = spark.sql("select *,"+ dup_cols_count_qry + "," + dup_cols_row_num_qry + " from contact " + " where null_count  = " + dup_cols_length )
    df2.show(false)
    df2.createOrReplaceTempView("contact2")
    val df3 = spark.sql("select " +  dup_cols_str +  ", " + rem_cols_str + " from contact2 where cnt > 1 and rwn > 1")
    df3.show(false)
    
    结果:

    +---+-----+----------+---------------+
    | id| name|     phone|       email_id|
    +---+-----+----------+---------------+
    |  6|haris|6543210777|haris@gmail.com|
    |  6|haris|6543210777|haris@gmail.com|
    |  3|  sam|9876543210|  sam@yahoo.com|
    |  3|  sam|9876543210|  sam@yahoo.com|
    |  9|  ram|8765432190|  ram@gmail.com|
    +---+-----+----------+---------------+
    
    +---+------+---+----------+----------------+----------+
    |id |name  |age|phone     |email_id        |null_count|
    +---+------+---+----------+----------------+----------+
    |4  |karthi|26 |4321066666|karthi@gmail.com|3         |
    |6  |haris |30 |6543210777|haris@gmail.com |3         |
    |6  |haris |30 |null      |haris@gmail.com |2         |
    |7  |ram   |27 |8765432190|ram@gmail.com   |3         |
    |9  |ram   |27 |8765432190|ram@gmail.com   |3         |
    |6  |haris |24 |6543210777|haris@gmail.com |3         |
    |6  |null  |24 |6543210777|null            |1         |
    |3  |sam   |23 |9876543210|sam@yahoo.com   |3         |
    |3  |sam   |23 |9876543210|sam@yahoo.com   |3         |
    |3  |sam   |28 |9876543210|sam@yahoo.com   |3         |
    |6  |haris |24 |6543210777|haris@gmail.com |3         |
    +---+------+---+----------+----------------+----------+
    
    
    |id |name  |age|phone     |email_id        |null_count|cnt|rwn|
    +---+------+---+----------+----------------+----------+---+---+
    |6  |haris |30 |6543210777|haris@gmail.com |3         |3  |1  |
    |6  |haris |24 |6543210777|haris@gmail.com |3         |3  |2  |
    |6  |haris |24 |6543210777|haris@gmail.com |3         |3  |3  |
    |3  |sam   |23 |9876543210|sam@yahoo.com   |3         |3  |1  |
    |3  |sam   |23 |9876543210|sam@yahoo.com   |3         |3  |2  |
    |3  |sam   |28 |9876543210|sam@yahoo.com   |3         |3  |3  |
    |7  |ram   |27 |8765432190|ram@gmail.com   |3         |2  |1  |
    |9  |ram   |27 |8765432190|ram@gmail.com   |3         |2  |2  |
    |4  |karthi|26 |4321066666|karthi@gmail.com|3         |1  |1  |
    +---+------+---+----------+----------------+----------+---+---+
    
    +-----+----------+---------------+---+---+
    |name |phone     |email_id       |id |age|
    +-----+----------+---------------+---+---+
    |haris|6543210777|haris@gmail.com|6  |24 |
    |haris|6543210777|haris@gmail.com|6  |24 |
    |sam  |9876543210|sam@yahoo.com  |3  |23 |
    |sam  |9876543210|sam@yahoo.com  |3  |28 |
    |ram  |8765432190|ram@gmail.com  |9  |27 |
    +-----+----------+---------------+---+---+
    
    空白支票

    val df_null_col = dup_cols.map( x => when(col(x).isNull or regexp_replace(col(x), """^\s*$""","")=== lit(""),0).otherwise(1)).reduce( _ + _ )
    
    仅当所有3列都为空或null时筛选

    val df = Seq(
      (4,"karthi",26,"4321066666","karthi@gmail.com"),
      (6,"haris",30,"6543210777","haris@gmail.com"),
      (6,null,30,null,null),
      (7,"ram",27,"8765432190","ram@gmail.com"),
      (9,"",27,"",""),
      (7,"ram",27,"8765432190","ram@gmail.com"),
      (6,"haris",24,"6543210777","haris@gmail.com"),
      (6,null,24,"6543210777",null),
      (3,"sam",23,"9876543210","sam@yahoo.com"),
      (3,null,23,"9876543210","sam@yahoo.com"),
      (3,null,28,"9876543213",null),
      (6,"haris",24,null,"haris@gmail.com")
    ).toDF("id","name","age","phone","email_id")
    
    val all_cols = df.columns
    val dup_cols = List("name","phone","email_id")
    val rem_cols = all_cols.diff(dup_cols)
    val dup_cols_str = dup_cols.mkString(",")
    val rem_cols_str = rem_cols.mkString(",")
    val dup_cols_length = dup_cols.length
    //val df_null_col = dup_cols.map( x => when(col(x).isNull,0).otherwise(1)).reduce( _ + _ )
    val df_null_col = dup_cols.map( x => when(col(x).isNull or regexp_replace(col(x),lit("""^\s*$"""),lit("")) === lit(""),0).otherwise(1)).reduce( _ + _ )
    val df_null = df.withColumn("null_count", df_null_col)
    df_null.createOrReplaceTempView("contact")
    df_null.show(false)
    
    val dup_cols_count_qry = " count(*) over(partition by " + dup_cols_str + " ) as cnt "
    val dup_cols_row_num_qry = " row_number() over(partition by " + dup_cols_str + " order by " + dup_cols_str + " ) as rwn "
    //val df2 = spark.sql("select *,"+ dup_cols_count_qry + "," + dup_cols_row_num_qry + " from contact " + " where null_count  = " + dup_cols_length )
    val df2 = spark.sql("select *,"+ dup_cols_count_qry + "," + dup_cols_row_num_qry + " from contact " + " where null_count  !=  0 ")
    df2.show(false)
    df2.createOrReplaceTempView("contact2")
    val df3 = spark.sql("select " +  dup_cols_str +  ", " + rem_cols_str + " from contact2 where cnt > 1 and rwn > 1")
    df3.show(false)
    

    您可以在
    partitionBy()
    中指定以逗号分隔的列列表。输入包含N个列。。其动态值在spark submit==SQL==select,count()over(按[condition:string]分区)作为联系人的cnt时,我正在获取以下异常---------------------------------------------------------------^^^^^^^位于org.apache.spark.SQL.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)在org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)上,dup\u cols\u qry似乎是空字符串。。再次检查是的,现在它工作了。。。谢谢stack0114106。。。不,这不是一个重复的问题…另一个问题的解释是“我正在计算每列或所需列的唯一和重复记录的数量。有关更多信息,请阅读该问题”我得到了问题。。我只是想试试。。差不多完成了一半。。顺便问一下,你是QA团队的吗。。你的问题总是与元编程有关我有这个问题的答案。。我想可能有一段时间不允许你提问。问一个新问题或给出你的邮件id