Python 获取按ID分区的每个字段的非空数据
我有一个这样的数据帧Python 获取按ID分区的每个字段的非空数据,python,pandas,Python,Pandas,我有一个这样的数据帧 id city province status date ---- -------- ---------- -------- ---------- 1 Cainta Rizal failed 22/07/2020 1 nan nan success 22/07/2020 1 nan nan success 22/07/2020
id city province status date
---- -------- ---------- -------- ----------
1 Cainta Rizal failed 22/07/2020
1 nan nan success 22/07/2020
1 nan nan success 22/07/2020
2 Pasig Manila success 22/07/2020
2 nan nan failed 22/07/2020
2 nan nan failed 22/07/2020
3 Marikina Manila failed 22/07/2020
3 nan nan success 22/07/2020
3 nan nan success 22/07/2020
我想要的是将上述数据帧转换为以下数据帧:
id city province status date
---- -------- ---------- -------- ----------
1 Cainta Rizal success 22/07/2020
2 Pasig Manila success 22/07/2020
3 Marikina Manila success 22/07/2020
因此,标准是,对于状态为'success'的每个Id,获取城市和省份的非空值。我可以使用以下代码在SQL中实现这一点,我想在pandas中复制这一点:
SELECT ID,
MAX(CITY) AS CITY,
MAX(PROVINCE) AS PROVINCE,
'SUCCESS' AS STATUS,
MAX(CASE WHEN STATUS = 'SUCCESS' THEN DATE END) AS "DATE",
FROM TABLE
GROUP BY ID
我希望我的例子很清楚。非常感谢你
编辑:我会对一百万行DF执行此操作,如果可能的话,每个
id
的所有缺失值最好由替换缺失值,然后按列status
过滤,最后按id
获取第一个唯一行:
我不确定此SQL查询是否会对每个状态为“success”的Id执行
如果该解决方案显式声明为“nan”,则效果良好,但如果它是空白或空格,则会用它填充整个字段。我所做的变通方法是用replace()方法替换空格和空格。实际代码是df=df.replace(r'^\s*$',np.nan,regex=True)。谢谢,顺便说一句!
cols = ['city','province']
df[cols] = df.groupby(df['id'])[cols].ffill()
df = df.query('status == "success"').drop_duplicates('id')
print (df)
id city province status date
1 1 Cainta Rizal success 22/07/2020
3 2 Pasig Manila success 22/07/2020
7 3 Marikina Manila success 22/07/2020