重复记录移动到Spark Scala中的其他数据帧
一个数据帧有300万条记录。我尝试只移动重复记录以分离数据帧。 我将spark 1.6与scala一起使用 资料 新数据帧重复记录移动到Spark Scala中的其他数据帧,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,一个数据帧有300万条记录。我尝试只移动重复记录以分离数据帧。 我将spark 1.6与scala一起使用 资料 新数据帧 IM,A-15ACWSSC,CP IM,A-15ACWSSC,CP 我用过的代码 var df = Seq( ("IM", "A-15ACWSSC", "ASSY 1.5V2", "CP"), ("IM", "A-15ACWSSC", "ASSY 1.5V2", "CP"), ("IN", "A-15ACWSSC", "ASSY 1.6
IM,A-15ACWSSC,CP
IM,A-15ACWSSC,CP
我用过的代码
var df = Seq(
("IM", "A-15ACWSSC", "ASSY 1.5V2", "CP"),
("IM", "A-15ACWSSC", "ASSY 1.5V2", "CP"),
("IN", "A-15ACWSSC", "ASSY 1.6V2", "CP1"),
("IN", "A-15ACWSSC", "ASSY 1.7V2", "CP2")
).toDF("COL1", "COL2", "COL3", "COL4")
df.show()
// +----+----------+----------+----+
// |COL1| COL2| COL3|COL4|
// +----+----------+----------+----+
// | IM|A-15ACWSSC|ASSY 1.5V2| CP|
// | IM|A-15ACWSSC|ASSY 1.5V2| CP|
// | IN|A-15ACWSSC|ASSY 1.6V2| CP1|
// | IN|A-15ACWSSC|ASSY 1.7V2| CP2|
// +----+----------+----------+----+
df.registerTempTable("CLEANFRAME")
val CleanData = sqlContext.sql(
"""select COL1,COL2,COL3,COL4
from
(SELECT COL1,COL2,COL3,COL4, count(1) over (partition by COL1,COL2,COL3,COL4) as Uniqueid
FROM CLEANFRAME)
where Uniqueid > 1
""").cache()
CleanData.show
但这并没有带来任何结果。如果我遗漏了任何内容,请提供帮助。应按以下方式修改您的内容。每列都应包含在分组中 编辑:使用窗口,保留重复记录
var df = Seq(
("IM","A-15ACWSSC","ASSY 1.5V2","CP"),
("IM","A-15ACWSSC","ASSY 1.5V2","CP"),
("IN","A-15ACWSSC","ASSY 1.6V2","CP1"),
("IN","A-15ACWSSC","ASSY 1.7V2","CP2")
).toDF("COL1", "COL2","COL3","COL4")
df.show()
// +----+----------+----------+----+
// |COL1| COL2| COL3|COL4|
// +----+----------+----------+----+
// | IM|A-15ACWSSC|ASSY 1.5V2| CP|
// | IM|A-15ACWSSC|ASSY 1.5V2| CP|
// | IN|A-15ACWSSC|ASSY 1.6V2| CP1|
// | IN|A-15ACWSSC|ASSY 1.7V2| CP2|
// +----+----------+----------+----+
df.createOrReplaceTempView("CLEANFRAME")
val CleanData= sqlContext.sql("""select COL1,COL2,COL3,COL4
from
(SELECT COL1,COL2,COL3,COL4, count(1) over (partition by COL1,COL2,COL3,COL4) as Uniqueid
FROM CLEANFRAME)
where Uniqueid > 1
""" ).cache()
错误:
Exception in thread "main" java.lang.RuntimeException: [3.79] failure: ``)'' expected but `(' found
(SELECT COL1,COL2,COL3,COL4, count(1) over (partition by COL1,COL2,COL3,COL4) as Uniqueid
你可以试试这个
非常感谢,但是它在主线程java.lang.RuntimeException:[3.79]失败时抛出了一个异常错误:``预期但``找到了SELECT COL1,COL2,COL3,COL4,COL1,COL2,COL3,COL4,COL4,COL4,COL4,COL1,COL4,COL1,COL2,COL4,COL4,COL4,COL4,COL4,COL4,COL4,COL4作为Uni。请帮忙。。。。。感谢lotHi Manish,我已经分享了有问题的代码本身。请建议您可以在spark sql中尝试以下操作:-从按COL1、COL2、COL3、COL4、COL1、COL2、COL3、COL4、count1分区中选择COL1、COL2、COL3、COL4,COL4作为CLEANFRAME a中的Uniqueid,其中Uniqueid>1注意:在嵌套的queryit引发错误后添加别名,因为无法解析窗口函数“count”。请注意,使用窗口函数当前需要HiveContext;知道如何在spark 1.6中启用配置单元支持吗
Exception in thread "main" java.lang.RuntimeException: [3.79] failure: ``)'' expected but `(' found
(SELECT COL1,COL2,COL3,COL4, count(1) over (partition by COL1,COL2,COL3,COL4) as Uniqueid
scala> import org.apache.spark.sql.expressions.Window
scala> import org.apache.spark.sql.functions._
scala> var win = Window.partitionBy("a","b","c","d").orderBy("a")
scala> var dff = Seq(("IM","A-15ACWSSC","ASSY 1.5V2","CP"), ("IM","A-15ACWSSC","ASSY 1.5V2","CP"), ("IM","AK11-130BA","13MM BLK RUBBER CAB FOOT","ap")).toDF("a","b","c","d")
scala> dff.show
+---+----------+--------------------+---+
| a| b| c| d|
+---+----------+--------------------+---+
| IM|A-15ACWSSC| ASSY 1.5V2| CP|
| IM|A-15ACWSSC| ASSY 1.5V2| CP|
| IM|AK11-130BA|13MM BLK RUBBER C...| ap|
+---+----------+--------------------+---+
for finding duplicates and based on that filter whose value is >= 2
scala> var dff_dup = dff.withColumn("test",count("*").over(win)).filter($"test">=2)
scala> dff_dup.show
+---+----------+----------+---+----+
| a| b| c| d|test|
+---+----------+----------+---+----+
| IM|A-15ACWSSC|ASSY 1.5V2| CP| 2|
| IM|A-15ACWSSC|ASSY 1.5V2| CP| 2|
+---+----------+----------+---+----+