Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/video/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在pyspark datafarme中查找重复列值_Pyspark_Duplicates_Find - Fatal编程技术网

如何在pyspark datafarme中查找重复列值

如何在pyspark datafarme中查找重复列值,pyspark,duplicates,find,Pyspark,Duplicates,Find,我试图从pyspark中的dataframe中找到重复的列值 例如,我有一个数据框,其中有一列“a”,其值如下: == A == 1 1 2 3 4 5 5 我希望输出如下(我只需要重复的值) 你能试试这个,看看是否有用吗 df = sqlContext.createDataFrame([(1,),(1,),(2,),(3,),(4,),(5,),(5,)],('A',)) df.createOrReplaceTempView(df_tbl) spark.sql("select A, coun

我试图从pyspark中的dataframe中找到重复的列值

例如,我有一个数据框,其中有一列“a”,其值如下:

==
A
==
1
1
2
3
4
5
5
我希望输出如下(我只需要重复的值)


你能试试这个,看看是否有用吗

df = sqlContext.createDataFrame([(1,),(1,),(2,),(3,),(4,),(5,),(5,)],('A',))
df.createOrReplaceTempView(df_tbl)
spark.sql("select A, count(*) as COUNT from df_tbl group by a having COUNT > 1").show()

+---+-----+
|  A|COUNT|
+---+-----+
|  5|2    |
|  1|2    |
+---+-----+

答案与@Yuva相同,但使用内置函数:

df=sqlContext.createDataFrame([(1,),(1,),(2,),(3,),(4,),(5,),(5,),('A',))
df.groupBy(“A”).count()。其中(“count>1”).drop(“count”).show()
+---+
|A|
+---+
|  5|
|  1|
+---+
df = sqlContext.createDataFrame([(1,),(1,),(2,),(3,),(4,),(5,),(5,)],('A',))
df.createOrReplaceTempView(df_tbl)
spark.sql("select A, count(*) as COUNT from df_tbl group by a having COUNT > 1").show()

+---+-----+
|  A|COUNT|
+---+-----+
|  5|2    |
|  1|2    |
+---+-----+