Apache spark 通过获取现有列的比率在Pyspark DataFrame中创建新列

Apache spark 通过获取现有列的比率在Pyspark DataFrame中创建新列,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,fillna,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,Fillna,我在PySpark数据框中有两列,我想在填充空值(不是就地)后取这两列的比率。目前,我的数据框架如下所示: +----+----+---+----+----+----+----+ |Acct| M1D|M1C| M2D| M2C| M3D| M3C| +----+----+---+----+----+----+----+ | B| 10|200|null|null| 20|null| | C|1000|100| 10|null|null|null| | A| 100|200|

我在PySpark数据框中有两列,我想在填充空值(不是就地)后取这两列的比率。目前,我的数据框架如下所示:

+----+----+---+----+----+----+----+
|Acct| M1D|M1C| M2D| M2C| M3D| M3C|
+----+----+---+----+----+----+----+
|   B|  10|200|null|null|  20|null|
|   C|1000|100|  10|null|null|null|
|   A| 100|200| 200| 200| 300|  10|
+----+----+---+----+----+----+----+
+------+------+-----+------+------+------+------+-------+
| Acct |  M1D | M1C |  M2D |  M2C |  M3D |  M3C | Ratio |
+------+------+-----+------+------+------+------+-------+
|    B |   10 | 200 | null | null | 20   | null |     0 |
|    C | 1000 | 100 | 10   | null | null | null |    10 |
|    A |  100 | 200 | 200  | 200  | 300  | 10   |    20 |
+------+------+-----+------+------+------+------+-------+
我期望的输出如下所示:

+----+----+---+----+----+----+----+
|Acct| M1D|M1C| M2D| M2C| M3D| M3C|
+----+----+---+----+----+----+----+
|   B|  10|200|null|null|  20|null|
|   C|1000|100|  10|null|null|null|
|   A| 100|200| 200| 200| 300|  10|
+----+----+---+----+----+----+----+
+------+------+-----+------+------+------+------+-------+
| Acct |  M1D | M1C |  M2D |  M2C |  M3D |  M3C | Ratio |
+------+------+-----+------+------+------+------+-------+
|    B |   10 | 200 | null | null | 20   | null |     0 |
|    C | 1000 | 100 | 10   | null | null | null |    10 |
|    A |  100 | 200 | 200  | 200  | 300  | 10   |    20 |
+------+------+-----+------+------+------+------+-------+
我想用
M3C
获取
M2D
的比率,以创建新列
ratio
。在计算比率之前,我想用
0
填充
M2D
,用
1
填充
M3C
,这将动态执行,以避免出现空值,并避免替换原位值

我试着使用下面的代码来实现这一点

df = df.withColumn('Ratio', col('M2D').fillna(0, subset=['M2D']) / col('M3C').fillna(1, subset=['M3C']))
上面的代码给了我以下错误

TypeError: 'Column' object is not callable
如上错误所述,为了避免TypeError,我尝试了以下代码行。我现在使用的是DataFrame,而不是column

df = df.withColumn('Ratio', df.select('M2D').fillna(0, subset=['M2D']) / df.select('M3C').fillna(1, subset=['M3C']))
上述代码导致以下错误

TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame'

如何实现所需的输出?

在计算比率之前,应先填充空值,如下所示:

df = df.fillna(0, subset=['M2D'])\
       .fillna(1, subset=['M3C'])\
       .withColumn('Ratio', col('M2D') / col('M3C'))
或者更简单,如果您只想在计算中避免空值,请按如下所示使用
coalesce

df = df.withColumn('Ratio', coalesce(col('M2D'), lit(0)) / coalesce(col('M3C'), lit(1)))

在计算比率之前,应填写空值,如下所示:

df = df.fillna(0, subset=['M2D'])\
       .fillna(1, subset=['M3C'])\
       .withColumn('Ratio', col('M2D') / col('M3C'))
或者更简单,如果您只想在计算中避免空值,请按如下所示使用
coalesce

df = df.withColumn('Ratio', coalesce(col('M2D'), lit(0)) / coalesce(col('M3C'), lit(1)))