Python 从PySpark中的数据帧中删除重复项_Python_Apache Spark_Pyspark_Duplicates_Pyspark Dataframes

Python 从PySpark中的数据帧中删除重复项

python apache-spark pyspark

Python 从PySpark中的数据帧中删除重复项,python,apache-spark,pyspark,duplicates,pyspark-dataframes,Python,Apache Spark,Pyspark,Duplicates,Pyspark Dataframes,我正在本地处理pyspark 1.4中的数据帧，在让dropDuplicates方法工作时遇到问题。它不断返回错误： “AttributeError:'list'对象没有属性'dropDuplicates'” 不太清楚为什么，因为我似乎在遵循中的语法这不是一个进口问题。您只需在错误的对象上调用.dropDuplicates（）。而sqlContext.createDataFrame（rdd1，…）的类是pyspark.sql.dataframe.dataframe，在应用.collect（）之

我正在本地处理pyspark 1.4中的数据帧，在让

dropDuplicates

方法工作时遇到问题。它不断返回错误：

“AttributeError:'list'对象没有属性'dropDuplicates'”

不太清楚为什么，因为我似乎在遵循中的语法

这不是一个进口问题。您只需在错误的对象上调用

.dropDuplicates（）

。而

sqlContext.createDataFrame（rdd1，…）

的类是

pyspark.sql.dataframe.dataframe

，在应用

.collect（）

之后，它是一个普通的Python

列表

，列表不提供

dropDuplicates

方法。你想要的是这样的：

 (df1 = sqlContext
     .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
     .dropDuplicates())

 df1.collect()

如果您有一个数据帧，并且希望删除所有重复项——引用特定列中的重复项（称为“colName”）：

重复数据消除前的计数：

df.count()

执行重复数据消除（将要进行重复数据消除的列转换为字符串类型）：

可以使用排序的groupby检查是否已删除重复项：

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)