Python PySpark:与.dropna()相反?

Python PySpark:与.dropna()相反?,python,pyspark,Python,Pyspark,我试图找出哪家商店有“空”的一天,即没有顾客来的一天 我的表具有以下结构: +----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+ | shop | 2020-10-15 | 2020-10-16 | 2020-10-17 | 2020-10-18 | 2020-10-19 | 2020-10-20 | 2020-

我试图找出哪家商店有“空”的一天,即没有顾客来的一天

我的表具有以下结构:

+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| shop     | 2020-10-15  | 2020-10-16  | 2020-10-17  | 2020-10-18  | 2020-10-19  | 2020-10-20  | 2020-10-21 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| Paris    | 215         | 213         | 128         | 102         | 195         | 180         |        110 |
| London   | 145         | 106         | 102         | 83          | 127         | 111         |         56 |
| Beijing  | 179         | 245         | 134         | 136         | 207         | 183         |        136 |
| Sydney   | 0           | 0           | 0           | 0           | 0           | 6           |         36 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
使用pandas,我可以做一些类似于
customers[customers==0].dropna(how=“all”)
的事情,这将只保留
0
所在的行,我得到以下结果:

+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| shop     | 2020-10-15  | 2020-10-16  | 2020-10-17  | 2020-10-18  | 2020-10-19  | 2020-10-20  | 2020-10-21 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| Sydney   | 0           | 0           | 0           | 0           | 0           | NaN         |         NaN|
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+

在PySpark中,我相信它做了类似的事情,但我想做相反的事情,保持NA/0值。如何才能做到这一点?

创建示例数据集:

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f

df_list= [
  { "shop":"Paris", "2020-10-15" : 215,"2020-10-16": 213, "2020-10-17" : 128,"2020-10-18": 195,"2020-10-19":195},
{"shop":"London", "2020-10-15" : 145,"2020-10-16": 106, "2020-10-17" : 102,"2020-10-18": 127,"2020-10-19":127},
 { "shop":"Beijing ", "2020-10-15" : 179,"2020-10-16": 245, "2020-10-17" : 136,"2020-10-18": 207,"2020-10-19":207},

 {"shop":"Sydney", "2020-10-15" : 0,"2020-10-16": 0 ,"2020-10-17" : 0,"2020-10-18": 0, "2020-10-19":0}

]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(Row(**x) for x in df_list)
df.show()
--

您可以应用过滤器功能

df.filter(f.greatest(*[f.col(i).isin(0) for i in df.columns])).show()
结果:

+------+----------+----------+----------+----------+----------+
|  shop|2020-10-15|2020-10-16|2020-10-17|2020-10-18|2020-10-19|
+------+----------+----------+----------+----------+----------+
|Sydney|         0|         0|         0|         0|         0|
+------+----------+----------+----------+----------+----------+


您可以从第一个数据帧的dropna()创建新的数据帧,然后在这两个数据帧之间使用left-anti-join。看看这个页面的左反加入嗨,谢谢你的回答,不幸的是我不喜欢迭代所有列。我想要一个解决方案,可以与任何数量的列。我已经修改了代码,希望这符合你的要求。这是伟大的,我想我可以使用它!简单地说一下,我们不需要使用
isin(0)
,因为只有一个值需要测试。使用
f.col(i)==0可能有效,对吗?
+------+----------+----------+----------+----------+----------+
|  shop|2020-10-15|2020-10-16|2020-10-17|2020-10-18|2020-10-19|
+------+----------+----------+----------+----------+----------+
|Sydney|         0|         0|         0|         0|         0|
+------+----------+----------+----------+----------+----------+