Python PySpark:与.dropna()相反?
我试图找出哪家商店有“空”的一天,即没有顾客来的一天 我的表具有以下结构:Python PySpark:与.dropna()相反?,python,pyspark,Python,Pyspark,我试图找出哪家商店有“空”的一天,即没有顾客来的一天 我的表具有以下结构: +----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+ | shop | 2020-10-15 | 2020-10-16 | 2020-10-17 | 2020-10-18 | 2020-10-19 | 2020-10-20 | 2020-
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| shop | 2020-10-15 | 2020-10-16 | 2020-10-17 | 2020-10-18 | 2020-10-19 | 2020-10-20 | 2020-10-21 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| Paris | 215 | 213 | 128 | 102 | 195 | 180 | 110 |
| London | 145 | 106 | 102 | 83 | 127 | 111 | 56 |
| Beijing | 179 | 245 | 134 | 136 | 207 | 183 | 136 |
| Sydney | 0 | 0 | 0 | 0 | 0 | 6 | 36 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
使用pandas,我可以做一些类似于customers[customers==0].dropna(how=“all”)
的事情,这将只保留0
所在的行,我得到以下结果:
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| shop | 2020-10-15 | 2020-10-16 | 2020-10-17 | 2020-10-18 | 2020-10-19 | 2020-10-20 | 2020-10-21 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| Sydney | 0 | 0 | 0 | 0 | 0 | NaN | NaN|
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
在PySpark中,我相信它做了类似的事情,但我想做相反的事情,保持NA/0值。如何才能做到这一点?创建示例数据集:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f
df_list= [
{ "shop":"Paris", "2020-10-15" : 215,"2020-10-16": 213, "2020-10-17" : 128,"2020-10-18": 195,"2020-10-19":195},
{"shop":"London", "2020-10-15" : 145,"2020-10-16": 106, "2020-10-17" : 102,"2020-10-18": 127,"2020-10-19":127},
{ "shop":"Beijing ", "2020-10-15" : 179,"2020-10-16": 245, "2020-10-17" : 136,"2020-10-18": 207,"2020-10-19":207},
{"shop":"Sydney", "2020-10-15" : 0,"2020-10-16": 0 ,"2020-10-17" : 0,"2020-10-18": 0, "2020-10-19":0}
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(Row(**x) for x in df_list)
df.show()
--
您可以应用过滤器功能
df.filter(f.greatest(*[f.col(i).isin(0) for i in df.columns])).show()
结果:
+------+----------+----------+----------+----------+----------+
| shop|2020-10-15|2020-10-16|2020-10-17|2020-10-18|2020-10-19|
+------+----------+----------+----------+----------+----------+
|Sydney| 0| 0| 0| 0| 0|
+------+----------+----------+----------+----------+----------+
您可以从第一个数据帧的dropna()创建新的数据帧,然后在这两个数据帧之间使用left-anti-join。看看这个页面的左反加入嗨,谢谢你的回答,不幸的是我不喜欢迭代所有列。我想要一个解决方案,可以与任何数量的列。我已经修改了代码,希望这符合你的要求。这是伟大的,我想我可以使用它!简单地说一下,我们不需要使用
isin(0)
,因为只有一个值需要测试。使用f.col(i)==0可能有效,对吗?
+------+----------+----------+----------+----------+----------+
| shop|2020-10-15|2020-10-16|2020-10-17|2020-10-18|2020-10-19|
+------+----------+----------+----------+----------+----------+
|Sydney| 0| 0| 0| 0| 0|
+------+----------+----------+----------+----------+----------+