根据列的数据类型在pyspark dataframe中填充空值_Pyspark_Apache Spark Sql

根据列的数据类型在pyspark dataframe中填充空值

pyspark

根据列的数据类型在pyspark dataframe中填充空值,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,假设我有一个示例数据帧，如下所示： +-----+----+----+ | col1|col2|col3| +-----+----+----+ | cat| 10| 1.5| | dog| 20| 9.0| | null| 30|null| |mouse|null|15.3| +-----+----+----+ 我想根据数据类型填充空值。例如，对于要用“N/A”填充的字符串类型，以及对于要添加0的整数类型。类似地，对于float，我想添加0.0 我尝试使用df.fillna（），但后

假设我有一个示例数据帧，如下所示：

+-----+----+----+
| col1|col2|col3|
+-----+----+----+
|  cat|  10| 1.5|
|  dog|  20| 9.0|
| null|  30|null|
|mouse|null|15.3|
+-----+----+----+

我想根据数据类型填充空值。例如，对于要用“N/A”填充的字符串类型，以及对于要添加0的整数类型。类似地，对于float，我想添加0.0

我尝试使用df.fillna（），但后来我意识到可能有“N”个列，因此我希望有一个动态解决方案。

df.dtypes

为您提供一个

元组（列名称、数据类型）

。它可用于获取

df

中的字符串/int/float列名列表。将这些列子集，并相应地

fillna（）

df = sc.parallelize([['cat', 10, 1.5], ['dog', 20, 9.0],\
                 [None, 30, None], ['mouse', None, 15.3]])\
                 .toDF(['col1', 'col2', 'col3'])

string_col = [item[0] for item in df.dtypes if item[1].startswith('string')]
big_int_col = [item[0] for item in df.dtypes if item[1].startswith('bigint')]
double_col = [item[0] for item in df.dtypes if item[1].startswith('double')]

df.fillna('N/A', subset = string_col)\
        .fillna(0, subset = big_int_col)\
        .fillna(0.0, subset = double_col)\
        .show()

输出：

+-----+----+----+
| col1|col2|col3|
+-----+----+----+
|  cat|  10| 1.5|
|  dog|  20| 9.0|
|  N/A|  30| 0.0|
|mouse|   0|15.3|
+-----+----+----+

df.dtypes

提供了

的元组（列名称、数据类型）

。它可用于获取

df

中的字符串/int/float列名列表。将这些列子集，并相应地

fillna（）

df = sc.parallelize([['cat', 10, 1.5], ['dog', 20, 9.0],\
                 [None, 30, None], ['mouse', None, 15.3]])\
                 .toDF(['col1', 'col2', 'col3'])

string_col = [item[0] for item in df.dtypes if item[1].startswith('string')]
big_int_col = [item[0] for item in df.dtypes if item[1].startswith('bigint')]
double_col = [item[0] for item in df.dtypes if item[1].startswith('double')]

df.fillna('N/A', subset = string_col)\
        .fillna(0, subset = big_int_col)\
        .fillna(0.0, subset = double_col)\
        .show()

输出：

+-----+----+----+
| col1|col2|col3|
+-----+----+----+
|  cat|  10| 1.5|
|  dog|  20| 9.0|
|  N/A|  30| 0.0|
|mouse|   0|15.3|
+-----+----+----+