Pandas:在多个列中查找具有匹配值的行的python方法(分层条件)
对不起,标题有点不清楚。我无法用语言简明扼要地描述这个问题。希望我下面的描述能帮助澄清。欢迎对标题进行任何澄清性编辑 我正在尝试从一个数据帧创建一个networkx流程图。数据框记录订单如何流经多家公司。dataframe中的大多数行都是连接的,连接显示在多个列中。样本数据如下:Pandas:在多个列中查找具有匹配值的行的python方法(分层条件),python,pandas,networkx,Python,Pandas,Networkx,对不起,标题有点不清楚。我无法用语言简明扼要地描述这个问题。希望我下面的描述能帮助澄清。欢迎对标题进行任何澄清性编辑 我正在尝试从一个数据帧创建一个networkx流程图。数据框记录订单如何流经多家公司。dataframe中的大多数行都是连接的,连接显示在多个列中。样本数据如下: df = pd.DataFrame({'Company': ['A', 'A', 'B', 'B', 'B', 'C', 'C'], 'event_type':['new', 'route'
df = pd.DataFrame({'Company': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'event_type':['new', 'route', 'receive', 'execute', 'route', 'receive', 'execute'],
'event_id': ['110', '120', '200', '210', '220', '300', '310'],
'prior_event_id': [np.nan, '110', np.nan, '120', '210', np.nan, '300'],
'route_id': [np.nan, 'foo', 'foo', np.nan, 'bar', 'bar', np.nan]}
)
数据框如下所示:
Company event_type event_id prior_event_id route_id
0 A new 110 NaN NaN
1 A route 120 110 foo
2 B receive 200 NaN foo
3 B execute 210 120 NaN
4 B route 220 210 bar
5 C receive 300 NaN bar
6 C execute 310 300 NaN
订单经过3家公司:A、B、C。在每个公司内,后一个事件可以通过event\u id
-previor\u event\u id
对链接到其源事件。但这种方法不适用于属于不同公司的记录。例如,第1行和第2行将仅通过一列route\u id
进行匹配。因此,我尝试重新创建的链接机制是一种层次结构,因为如果event\u id
-previor\u event\u id
列对没有产生任何结果,我将只使用列route\u id
进行匹配
下图可能有助于说明链接机制:
我的解决方案相当笨拙:
# Make every event unique so as to not confound the linking
df['event_sub'] = df.groupby(df.event_type).cumcount()+1
df['event'] = df.event_type + ' ' + df.event_sub.astype(str)
# Find the match based on first matching criterion
replace_dict_event = dict(df[['event_id', 'event']].values)
df['source'] = df['prior_event_id'].apply(lambda x: replace_dict_event.get(x) if replace_dict_event.get(x) else np.nan )
df['target'] = df['event_id'].apply(lambda x: replace_dict_event.get(x) if replace_dict_event.get(x) else np.nan )
# From last step, find the match based on second matching criterion for the unmatched rows
replace_dict_rtd = dict(df[df.event_type == 'route'][['route_id', 'event']].values)
df.loc[df.event_type == 'receive', 'source'] = df[df.event_type == 'receive']['route_id'].apply(lambda x: replace_dict_rtd.get(x))
df
我基本上使用了两次apply
来逐步获得匹配。我想知道是否有一种更干净、更像蟒蛇的方法来做这件事
我的结果如下:
以及我由此创建的networkx图:
您有两种不同类型的链接:a)通过匹配
先前的事件id
和事件id
定义的链接,以及b)通过路由id
定义的链接。使用两组不同的命令来提取两种不同类型的关系是pythonic(或者仅仅是良好的编码实践)
也就是说,由于您处理的是表格数据,因此最好使用合并(特别是内部联接)来提取链接,而不是使用apply进行字典查找。表格数据的数据库针对这类查询进行了优化,而大型数据集的查询速度会慢得多
#!/usr/bin/env python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
if __name__ == '__main__':
df = pd.DataFrame({'Company': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'event_type':['new', 'route', 'receive', 'execute', 'route', 'receive', 'execute'],
'event_id': ['110', '120', '200', '210', '220', '300', '310'],
'prior_event_id': [np.nan, '110', np.nan, '120', '210', np.nan, '300'],
'route_id': [np.nan, 'foo', 'foo', np.nan, 'bar', 'bar', np.nan]}
)
# --------------------------------------------------------------------------------
# a) links established by matching event_id with prior_event_id
df2 = pd.merge(df, df, left_on='event_id', right_on='prior_event_id', how='inner')
# Company_x event_type_x event_id_x prior_event_id_x route_id_x Company_y event_type_y event_id_y prior_event_id_y route_id_y
# 0 A new 110 NaN NaN A route 120 110 foo
# 1 A route 120 110 foo B execute 210 120 NaN
# 2 B execute 210 120 NaN B route 220 210 bar
# 3 C receive 300 NaN bar C execute 310 300 NaN
# --------------------------------------------------------------------------------
# b) links established by matching route_id
# remove events without route ids
valid = df['route_id'].notna()
df3 = df['valid']
# Company event_type event_id prior_event_id route_id
# 1 A route 120 110 foo
# 2 B receive 200 NaN foo
# 4 B route 220 210 bar
# 5 C receive 300 NaN bar
# join on route_id
df4 = pd.merge(df3, df3, on='route_id', how='inner')
# Company_x event_type_x event_id_x prior_event_id_x route_id Company_y event_type_y event_id_y prior_event_id_y
# 0 A route 120 110 foo A route 120 110
# 1 A route 120 110 foo B receive 200 NaN
# 2 B receive 200 NaN foo A route 120 110
# 3 B receive 200 NaN foo B receive 200 NaN
# 4 B route 220 210 bar B route 220 210
# 5 B route 220 210 bar C receive 300 NaN
# 6 C receive 300 NaN bar B route 220 210
# 7 C receive 300 NaN bar C receive 300 NaN
# remove cases where a company was matched to itself
valid = df4['Company_x'] != df4['Company_y']
df5 = df4[valid]
# Company_x event_type_x event_id_x prior_event_id_x route_id Company_y event_type_y event_id_y prior_event_id_y
# 1 A route 120 110 foo B receive 200 NaN
# 2 B receive 200 NaN foo A route 120 110
# 5 B route 220 210 bar C receive 300 NaN
# 6 C receive 300 NaN bar B route 220 210