Scala 在列中删除重复项
我可以消除第3列和第4列中的多个值吗Scala 在列中删除重复项,scala,apache-spark,Scala,Apache Spark,我可以消除第3列和第4列中的多个值吗 +--------+--------+--------+--------+ |Column_1|Column_2|Column_3|Column_4| +--------+--------+--------+--------+ | 1| x| abc| www| | 1| x| abc| sdf| | 1| x| abc| xyz| |
+--------+--------+--------+--------+
|Column_1|Column_2|Column_3|Column_4|
+--------+--------+--------+--------+
| 1| x| abc| www|
| 1| x| abc| sdf|
| 1| x| abc| xyz|
| 1| x| def| www|
| 1| x| def| sdf|
| 1| x| def| xyz|
+--------+--------+--------+--------+
预期产量
+--------+--------+--------+--------+
|Column_1|Column_2|Column_3|Column_4|
+--------+--------+--------+--------+
| 1| x| abc| www|
| 1| x| def| sdf|
| 1| x| null| xyz|
+--------+--------+--------+--------+
使用df.dropDuplicates(第3列、第4列)
另外,复制自val df1=Seq((1,“x”,“abc”),(1,“x”,“def”))。toDF(“Column_1”,“Column_2”,“Column_3”)>val df2=Seq((1,“x”,“xyz”),(1,“x”,“sdf”)。toDF(“Column_1”,“Column_2”,“Column_4”)>val df3=df1。连接(df2,Seq(“Column_1”,“Column_2”),“outer”)6.3.4.显示,df3.3.7 7 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7)xyz | df3.dropDuplicates(“column_3”,“column_4”)res68:org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]=[column_1:int,column_2:string…另外两个字段]大规模A>res68.res68.表级级级以上以上的显示.显示+---------------------------------------------------------------------------+------------------------------->表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层表层>表层表层表层表层表层表层表层>表层表层表层表层>表层>表层>表层>表层>表层>表层>表层>表层>第1个第1个第2个柱,第3个柱,第3个柱,第4个柱,第4个4 4 4 4\124??????12455 5 5+++++---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+-------------------------+--------------------------------------------------+---------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-------------------------+-------------------------+-------------------------+---------------------------------------------------------------------------+--------------------------------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+----------------------------------------------------------------------------------------------------+--------------------------------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+-------------------------+++++)abc | xyz |+----------------+-----------+尝试df.dropDuplicates(数组(“列3”))
.Hi@Uservxn-欢迎使用SO:)两个问题:1。你的spark版本是什么?2.是否有任何规则可以保留第3列
和第4列
的组合,比如如何决定保留'abc|www`或abc|sdf
。