Pyspark删除具有10个空值的列_Pyspark_Parquet

Pyspark删除具有10个空值的列

pyspark

Pyspark删除具有10个空值的列,pyspark,parquet,Pyspark,Parquet,我是PySpark的新手我读过一份拼花文件。我只想保留至少有10个值的列我已经使用descripe来获取每列的NOTNULL记录的计数现在如何提取值小于10的列名，然后在写入新文件之前删除这些列 df=spark.read.parquetfile col_count=df.descripe.filter$summary==count您可以将其转换为字典，然后根据值count

我是PySpark的新手

我读过一份拼花文件。我只想保留至少有10个值的列

我已经使用descripe来获取每列的NOTNULL记录的计数

现在如何提取值小于10的列名，然后在写入新文件之前删除这些列

df=spark.read.parquetfile

col_count=df.descripe.filter$summary==count

您可以将其转换为字典，然后根据值count<10筛选出键列名，该计数是一个StringType，需要在Python代码中转换为int：

# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')

# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()

# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]

# drop bad columns
df_new = df.drop(*bad_cols)

一些注意事项：

如果无法直接从df.descripe或df.summary等检索信息，请使用@pault的方法

您需要删除而不是选择列，因为descripe/summary只包括数字列和字符串列，从df.descripe处理的列表中选择列将丢失TimestampType、ArrayType等列

您可以将其转换为字典，然后根据值count<10过滤掉键列名，该计数是一个StringType，需要在Python代码中转换为int：

# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')

# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()

# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]

# drop bad columns
df_new = df.drop(*bad_cols)

一些注意事项：

如果无法直接从df.descripe或df.summary等检索信息，请使用@pault的方法

您需要删除而不是选择列，因为descripe/summary只包括数字列和字符串列，从df.descripe处理的列表中选择列将丢失TimestampType、ArrayType等列

可能的重复承认它不是一个精确的重复，但解决方案基本上是相同的可能重复承认它不是一个精确的重复，但解决方案基本上是samen/p，只是一个提醒，如果您还想检查并删除日期、时间戳列，这不会有帮助，因为df.descripe或df.summary不会计算这些列。周末愉快：谢谢@jxc。拼花地板文件的属性作为列表中结构的行对象。因此，我能够使用groupby和count查找坏列，然后使用where子句将它们过滤掉。而且，此解决方案处理所有数据类型。提供的示例有助于达成通用解决方案n/p，只是一个提醒，如果您还想检查和删除日期、时间戳列，这将不会有帮助，因为df.descripe或df.summary不会计算这些列。周末愉快：谢谢@jxc。拼花地板文件的属性作为列表中结构的行对象。因此，我能够使用groupby和count查找坏列，然后使用where子句将它们过滤掉。而且，此解决方案处理所有数据类型。提供的示例有助于达成通用解决方案