Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Pyspark,从带有null的子集中删除行,保存它们,然后再次添加它们_Apache Spark_Pyspark - Fatal编程技术网

Apache spark Pyspark,从带有null的子集中删除行,保存它们,然后再次添加它们

Apache spark Pyspark,从带有null的子集中删除行,保存它们,然后再次添加它们,apache-spark,pyspark,Apache Spark,Pyspark,就是这样,基本上我有这样的东西: C1 C2 C3 C4 a 0 1 null 4 b 0 1 3 4 c 0 1 4 4 d 0 null 5 4 到目前为止,我是这样做的,效果很好: sub=['C2','C3'] df = df.na.drop(subset=sub) C1 C2 C3 C4 b 0 1 3 4 c 0 1 4

就是这样,基本上我有这样的东西:

   C1   C2   C3    C4
a   0    1    null  4
b   0    1    3     4
c   0    1    4     4
d   0    null 5     4
到目前为止,我是这样做的,效果很好:

sub=['C2','C3']
df = df.na.drop(subset=sub)

   C1   C2   C3   C4
b   0    1    3    4
c   0    1    4    4
但现在我想在另一个数据帧上保存这些带有null的行,以便稍后使用一些函数添加它们

Dataframe_of_nulls:
   C1   C2   C3   C4
a   0    1    null 4
d   0    null 5    4

请随意忽略索引,它们只是为了减少解释的混乱。

您可以针对每个条件进行筛选:

从pyspark.sql.functions导入col,lit
从操作员导入或
从functools导入reduce
def split_on_null(df,子集):
any_null=reduce(或(col(c).isNull()表示子集中的c),lit(False))
返回df.where(any_null),df.where(~any_null)
用法:

df=spark.createDataFrame([
(0,1,None,4),(0,1,3,4),(0,1,4,4),(0,None,5,4),
(0,1,3,4),(0,无,5,4)]
).toDF(“c1”、“c2”、“c3”、“c4”)
有空,无空=在空(df,sub)上拆分
带_nulls.show()
+---+----+----+---+
|c1 | c2 | c3 | c4|
+---+----+----+---+
|0 | 1 |空| 4|
|0 |零| 5 | 4|
|0 |零| 5 | 4|
+---+----+----+---+
不带_nulls.show()
+---+---+---+---+
|c1 | c2 | c3 | c4|
+---+---+---+---+
|  0|  1|  3|  4|
|  0|  1|  4|  4|
|  0|  1|  3|  4|
+---+---+---+---+
另一种解决方案是
减去

without_nulls_ = df.na.drop(subset=sub)
with_nulls_ = df.subtract(without_nulls_ )
但它的成本要高得多,而且不会保留副本:

without_nulls_.show()
+---+---+---+---+
|c1 | c2 | c3 | c4|
+---+---+---+---+
|  0|  1|  3|  4|
|  0|  1|  4|  4|
|  0|  1|  3|  4|
+---+---+---+---+
带nulls_uu.show()
+--+--++
|c1 | c2 | c3 | c4|
+---+----+----+---+
|0 |零| 5 | 4|
|0 | 1 |空| 4|
+---+----+----+---+