Python 将PySpark中的0替换为null

Python 将PySpark中的0替换为null,python,replace,pyspark,Python,Replace,Pyspark,我的PySpark数据框中有一些值显示为NaN,我发现可以将它们转换为空值。然后,我通过将该值输入到其他值来调整这些空值。在执行此操作时,我发现它也将我的许多列中的0变为null。为什么会发生这种情况?我如何将nan转换为NULL而不影响0s cSchema = StructType([StructField("col", LongType())]) vals = [[0] for i in range(20)] test_df = spark.createDataFrame(vals,sche

我的PySpark数据框中有一些值显示为NaN,我发现可以将它们转换为空值。然后,我通过将该值输入到其他值来调整这些空值。在执行此操作时,我发现它也将我的许多列中的0变为null。为什么会发生这种情况?我如何将nan转换为NULL而不影响0s

cSchema = StructType([StructField("col", LongType())])
vals = [[0] for i in range(20)]
test_df = spark.createDataFrame(vals,schema=cSchema)

test_df.show(20)

+---+
|col|
+---+
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|
|  0|

test_df = test_df.replace(float('nan'), None)

test_df.show(20)

+----+
| col|
+----+
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|

示例中的模式不适合您尝试执行的操作。您正在(长)整数列中搜索浮点值。我很惊讶
replace
没有完全忽略该列…
下面是当您尝试直接创建这样一个DF时发生的情况:

>>> cSchema = StructType([StructField("col1", LongType()),StructField("col2", LongType())])
... vals = [[0, float('nan')] for i in range(20)]
... test_df = spark.createDataFrame(vals,schema=cSchema)
...
... test_df.show(20)
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\session.py", line 748, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\session.py", line 413, in _createFromLocal
    data = list(data)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\session.py", line 730, in prepare
    verify_func(obj)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1389, in verify
    verify_value(obj)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1370, in verify_struct
    verifier(v)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1389, in verify
    verify_value(obj)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1383, in verify_default
    verify_acceptable_types(obj)
  File "D:\Spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\types.py", line 1278, in verify_acceptable_types
    % (dataType, obj, type(obj))))
TypeError: field col2: LongType can not accept object nan in type <class 'float'>

field col2: LongType can not accept object nan in type <class 'float'>
因此,可以尝试预先将所有内容强制转换为float/double(如果nan-s在整数列中混合),或者使用的
subset
参数仅指定要搜索的float列

>>> cSchema = StructType([StructField("col1", DoubleType()),StructField("col2", DoubleType())])
... vals = [[0., float('nan')] for i in range(20)]
... test_df = spark.createDataFrame(vals,schema=cSchema)
...
... test_df.show(3)
+----+----+
|col1|col2|
+----+----+
| 0.0| NaN|
| 0.0| NaN|
| 0.0| NaN|
+----+----+
only showing top 3 rows

>>> test_df.replace(float('nan'), None).show(3)
+----+----+
|col1|col2|
+----+----+
| 0.0|null|
| 0.0|null|
| 0.0|null|
+----+----+
only showing top 3 rows