Apache spark Pyspark,从带有null的子集中删除行,保存它们,然后再次添加它们
就是这样,基本上我有这样的东西:Apache spark Pyspark,从带有null的子集中删除行,保存它们,然后再次添加它们,apache-spark,pyspark,Apache Spark,Pyspark,就是这样,基本上我有这样的东西: C1 C2 C3 C4 a 0 1 null 4 b 0 1 3 4 c 0 1 4 4 d 0 null 5 4 到目前为止,我是这样做的,效果很好: sub=['C2','C3'] df = df.na.drop(subset=sub) C1 C2 C3 C4 b 0 1 3 4 c 0 1 4
C1 C2 C3 C4
a 0 1 null 4
b 0 1 3 4
c 0 1 4 4
d 0 null 5 4
到目前为止,我是这样做的,效果很好:
sub=['C2','C3']
df = df.na.drop(subset=sub)
C1 C2 C3 C4
b 0 1 3 4
c 0 1 4 4
但现在我想在另一个数据帧上保存这些带有null的行,以便稍后使用一些函数添加它们
Dataframe_of_nulls:
C1 C2 C3 C4
a 0 1 null 4
d 0 null 5 4
请随意忽略索引,它们只是为了减少解释的混乱。您可以针对每个条件进行筛选:
从pyspark.sql.functions导入col,lit
从操作员导入或
从functools导入reduce
def split_on_null(df,子集):
any_null=reduce(或(col(c).isNull()表示子集中的c),lit(False))
返回df.where(any_null),df.where(~any_null)
用法:
df=spark.createDataFrame([
(0,1,None,4),(0,1,3,4),(0,1,4,4),(0,None,5,4),
(0,1,3,4),(0,无,5,4)]
).toDF(“c1”、“c2”、“c3”、“c4”)
有空,无空=在空(df,sub)上拆分
带_nulls.show()
+---+----+----+---+
|c1 | c2 | c3 | c4|
+---+----+----+---+
|0 | 1 |空| 4|
|0 |零| 5 | 4|
|0 |零| 5 | 4|
+---+----+----+---+
不带_nulls.show()
+---+---+---+---+
|c1 | c2 | c3 | c4|
+---+---+---+---+
| 0| 1| 3| 4|
| 0| 1| 4| 4|
| 0| 1| 3| 4|
+---+---+---+---+
另一种解决方案是减去:
without_nulls_ = df.na.drop(subset=sub)
with_nulls_ = df.subtract(without_nulls_ )
但它的成本要高得多,而且不会保留副本:
without_nulls_.show()
+---+---+---+---+
|c1 | c2 | c3 | c4|
+---+---+---+---+
| 0| 1| 3| 4|
| 0| 1| 4| 4|
| 0| 1| 3| 4|
+---+---+---+---+
带nulls_uu.show()
+--+--++
|c1 | c2 | c3 | c4|
+---+----+----+---+
|0 |零| 5 | 4|
|0 | 1 |空| 4|
+---+----+----+---+