Python 仅当条件为真时才合并
我有两个数据帧,我需要合并这两个数据帧,我需要添加一个列来指定它是否被接受 我有这个:Python 仅当条件为真时才合并,python,pandas,Python,Pandas,我有两个数据帧,我需要合并这两个数据帧,我需要添加一个列来指定它是否被接受 我有这个: dfa[dfa.CONTROL.isin([334030860978638])] Out[107]: CONTROL A B DATE_HOUR 1629136 334030860978638 525562414612 52447860015000 2015-08-02 16:32
dfa[dfa.CONTROL.isin([334030860978638])]
Out[107]:
CONTROL A B DATE_HOUR
1629136 334030860978638 525562414612 52447860015000 2015-08-02 16:32:00
1629137 334030860978638 525562414612 52447860015000 2015-08-02 16:42:32
1629138 334030860978638 525562414612 52447860015000 2015-08-02 18:33:12
1629139 334030860978638 525562414612 52447860015000 2015-08-03 19:40:19
dfb[dfb.control.isin([334030860978638])]
Out[108]:
control a b date_hour
id
299366338 334030860978638 525562414612 447860015000 2015-08-02 16:33:08
299392621 334030860978638 525562414612 447860015000 2015-08-02 16:43:40
299665465 334030860978638 525562414612 447860015000 2015-08-02 18:34:21
view = dfa.merge(dfb, left_on=['CONTROL', 'A', 'B'],
right_on=['control', 'a', 'b'], how='outer')
我需要将DATE_HOUR与DATE_HOUR进行比较,如果记录在时间范围内,例如3600秒,我还需要确定是否有多条记录在时间范围内,然后我将得到最近的一条记录并将其标记,在接受的新列中,我将设置为True,否则设置为False
我的预期产出:
CONTROL A B DATE_HOUR control a b date_hour accepted
334030860978638 525562414612 52447860015000 2015-08-02 16:32:00 334030860978638 525562414612 52447860015000 2015-08-02 16:32:08 True
334030860978638 525562414612 52447860015000 2015-08-02 16:42:32 334030860978638 525562414612 52447860015000 2015-08-02 16:43:40 True
334030860978638 525562414612 52447860015000 2015-08-02 18:33:12 334030860978638 525562414612 52447860015000 2015-08-02 18:34:21 True
334030860978638 525562414612 52447860015000 2015-08-03 19:40:19 NaN NaN Nan NaT False
我可以使用apply方法来完成这项任务吗?有人可以帮助我以正确的方式使用pandas。这有助于我解决问题
def nearest(group, match, groupname, lname, rname, name_field_diff='diff_minutes'):
match = match[match[groupname] == group.name]
try:
nbrs = NearestNeighbors(1).fit(match[rname].values[:, None])
dist, ind = nbrs.kneighbors(group[lname].values[:, None])
group[lname] = group[lname]
group[rname] = match[rname].values[ind.ravel()]
time_diff = (group[rname] - group[lname]) / np.timedelta64(1, 'm')
group[name_field_diff] = time_diff.abs()
except:
pass
return group
d1 = [{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:32:00'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:42:32'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 18:33:12'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 19:40:19'}]
d2 = [{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:33:08'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:43:40'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 18:34:21'}]
df1 = pd.DataFrame(d1)
df1.DATE_HOUR = pd.to_datetime(df1.DATE_HOUR, format='%Y-%m-%d %H:%M:%S')
df2 = pd.DataFrame(d2)
df2.date_hour = pd.to_datetime(df2.date_hour, format='%Y-%m-%d %H:%M:%S')
df1.groupby('CONTROL').apply(nearest, df2, 'control', 'DATE_HOUR', 'date_hour')
A B CONTROL DATE_HOUR date_hour diff_minutes
0 525562414612 52447860015000 334030860978638 2015-08-02 16:32:00 2015-08-02 16:33:08 1.133333
1 525562414612 52447860015000 334030860978638 2015-08-02 16:42:32 2015-08-02 16:43:40 1.133333
2 525562414612 52447860015000 334030860978638 2015-08-02 18:33:12 2015-08-02 18:34:21 1.150000
3 525562414612 52447860015000 334030860978638 2015-08-02 19:40:19 2015-08-02 18:34:21 65.966667
现在,我使用我的间隙进行过滤,以确定哪些记录不适合
df1[df1.index.isin(view[(view.diff_minutes >= 60)].index)]
A B CONTROL DATE_HOUR
3 525562414612 52447860015000 334030860978638 2015-08-02 19:40:19
这有助于我解决我的问题
def nearest(group, match, groupname, lname, rname, name_field_diff='diff_minutes'):
match = match[match[groupname] == group.name]
try:
nbrs = NearestNeighbors(1).fit(match[rname].values[:, None])
dist, ind = nbrs.kneighbors(group[lname].values[:, None])
group[lname] = group[lname]
group[rname] = match[rname].values[ind.ravel()]
time_diff = (group[rname] - group[lname]) / np.timedelta64(1, 'm')
group[name_field_diff] = time_diff.abs()
except:
pass
return group
d1 = [{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:32:00'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:42:32'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 18:33:12'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 19:40:19'}]
d2 = [{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:33:08'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:43:40'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 18:34:21'}]
df1 = pd.DataFrame(d1)
df1.DATE_HOUR = pd.to_datetime(df1.DATE_HOUR, format='%Y-%m-%d %H:%M:%S')
df2 = pd.DataFrame(d2)
df2.date_hour = pd.to_datetime(df2.date_hour, format='%Y-%m-%d %H:%M:%S')
df1.groupby('CONTROL').apply(nearest, df2, 'control', 'DATE_HOUR', 'date_hour')
A B CONTROL DATE_HOUR date_hour diff_minutes
0 525562414612 52447860015000 334030860978638 2015-08-02 16:32:00 2015-08-02 16:33:08 1.133333
1 525562414612 52447860015000 334030860978638 2015-08-02 16:42:32 2015-08-02 16:43:40 1.133333
2 525562414612 52447860015000 334030860978638 2015-08-02 18:33:12 2015-08-02 18:34:21 1.150000
3 525562414612 52447860015000 334030860978638 2015-08-02 19:40:19 2015-08-02 18:34:21 65.966667
现在,我使用我的间隙进行过滤,以确定哪些记录不适合
df1[df1.index.isin(view[(view.diff_minutes >= 60)].index)]
A B CONTROL DATE_HOUR
3 525562414612 52447860015000 334030860978638 2015-08-02 19:40:19
这是一个非常有趣的问题。想象一下,您的
dfa
已经有了dfb
中的列,只缺少值。然后它就变成了一个丢失数据的问题,您基本上希望为dfa
的每一行解决最近邻问题。首先在CONTROL
和CONTROL
上分组,然后在DATE\u HOUR
和DATE\u HOUR
上排序。接下来,您必须查找并调整最近邻算法以满足您的需求。我接受您的建议,并为此感谢您,我真的感到困惑。这是一个非常有趣的问题。想象一下,您的dfa
已经有了dfb
中的列,只缺少值。然后它就变成了一个丢失数据的问题,您基本上希望为dfa
的每一行解决最近邻问题。首先在CONTROL
和CONTROL
上分组,然后在DATE\u HOUR
和DATE\u HOUR
上排序。下一步,你必须查找并调整最近邻算法以适应你的原因。我接受你的建议,并为此感谢你,我真的感到困惑。