Python 基于不完全匹配的时间戳的合并_Python_Pandas

Python 基于不完全匹配的时间戳的合并

python pandas

Python 基于不完全匹配的时间戳的合并,python,pandas,Python,Pandas,有哪些方法可用于合并时间戳不完全匹配的列 DF1： DF2：我可以在['date'、'employee_id'、'session_id']加入，但有时同一名员工在同一日期会有多个相同的会话，这会导致重复。我可以删除发生这种情况的行，但如果这样做，我将丢失有效的会话如果DF1的时间戳是的，是否有一种有效的加入方法考虑以下问题的小版本： from io import StringIO from pandas import read_csv, to_datetime # how close do

有哪些方法可用于合并时间戳不完全匹配的列

DF1：

DF2：

我可以在['date'、'employee_id'、'session_id']加入，但有时同一名员工在同一日期会有多个相同的会话，这会导致重复。我可以删除发生这种情况的行，但如果这样做，我将丢失有效的会话

如果DF1的时间戳是的，是否有一种有效的加入方法考虑以下问题的小版本：

from io import StringIO
from pandas import read_csv, to_datetime

# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)

给

>>> df1
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:03:00      7261824   871631183
2 2016-01-01 11:01:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
>>> df2
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:10:00      7261824   871631185

合并时，您希望将

df2[0:3]

视为

df1[0:3]

的副本（因为它们的间隔分别小于5分钟），但将

df1[3]

和

df2[3]

视为单独的会话

解决方案1：区间匹配这基本上就是您在编辑中提出的建议。您希望将两个表中的时间戳映射到以四舍五入到最近的5分钟为中心的10分钟间隔

每个间隔可以由其中点唯一表示，因此可以合并时间戳上的数据帧，四舍五入到最接近的5分钟。例如：

import numpy as np

# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9

# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)

# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

哪张照片

             interval  employee_id  session_id
0 2016-01-01 02:05:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:00:00      7261824   871631184
3 2016-01-01 14:00:00      7261824   871631185
4 2016-01-01 11:05:00      7261824   871631184
5 2016-01-01 14:10:00      7261824   871631185

      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
4 2016-01-01 14:10:00      7261824   871631185

请注意，这并不完全正确。会话

df1[2]

和

df2[2]

不被视为重复会话，尽管它们仅相隔3分钟。这是因为它们位于区间边界的不同侧面

解决方案2：一对一匹配下面是另一种方法，它取决于

df1

中的会话在

df2

中有零个或一个重复的情况

我们将

df1

中的时间戳替换为

df2

中最接近的时间戳，该时间戳与

employee\u id

和

session\u id

匹配，并且距离
不到5分钟 from datetime import timedelta # get closest match from "df2" to row from "df1" (as long as it's below the threshold) def closest(row): matches = df2.loc[(df2.employee_id == row.employee_id) & (df2.session_id == row.session_id)] deltas = matches.date_start_time - row.date_start_time deltas = deltas.loc[deltas <= timedelta(minutes=threshold)] try: return matches.loc[deltas.idxmin()] except ValueError: # no items return row # replace timestamps in "df1" with closest timestamps in "df2" df1 = df1.apply(closest, axis=1) # join cols = ['date_start_time', 'employee_id', 'session_id'] print df1.merge(df2, on=cols, how='outer')[cols] 这种方法的速度要慢得多，因为您必须为df1 中的每一行搜索整个df2 。我所写的内容可能可以进一步优化，但在大型数据集上这仍然需要很长时间。我将尝试在熊猫中使用此方法：您感兴趣的参数将是方向，公差，左侧，以及右侧构建@Igor答案：将熊猫作为pd导入从熊猫导入读取从io导入StringIO #datetime列（日期+开始时间的组合） dtc=['日期'，'开始时间']] #索引列（以上组合） ixc='日期\开始\时间' df1=读取csv（字符串（u“”）日期、开始时间、员工id、会话id 01/01/2016,02:03:00,7261824,871631182 01/01/2016,06:03:00,7261824,871631183 01/01/2016,11:01:00,7261824,871631184 01/01/2016,14:01:00,7261824,871631185 ''），解析日期=dtc） df2=读取csv（字符串（u“”）日期、开始时间、员工id、会话id 01/01/2016,02:03:00,7261824,871631182 01/01/2016,06:05:00,7261824,871631183 01/01/2016,11:04:00,7261824,871631184 01/01/2016,14:10:00,7261824,871631185 ''），解析日期=dtc） df1['date\u start\u time']=pd.to\u datetime（df1['date\u start\u time']） df2['date\u start\u time']=pd.to\u datetime（df2['date\u start\u time']） #将其转换为索引，这样我们可以保留日期\开始\时间列，这样您就可以验证合并逻辑 df1.index=df1['date\u start\u time'] df2.index=df2['date\u start\u time'] #魔术发生在下面，检查方向和公差参数 tol=pd.Timedelta（5分钟） pd.merge_asof（左=df1，右=df2，右=真，左=真，方向=最近，公差=tol）我建议使用内置的pandas Series dt round函数，将两个数据帧取整到一个公共时间，例如每5分钟取整一次。因此，时间将始终采用以下格式：例如，01:00:00，然后是01:05:00。这样，两个数据帧将有相似的时间索引来执行合并请参见此处的文档和示例有趣的问题。最简单的解决方案是将时间戳四舍五入到最接近的5分钟进行合并，但如果某些会话恰好位于5分钟标记的不同侧面，则会将它们保留为单独的行。您可以使用随机偏移量迭代应用该过程，最多迭代一定次数，这将产生更好的结果。最稳健的解决方案是聚类算法，但这更难实现。可以提供一些启发。理想情况下，您希望在join 操作上使用SQL风格的where 子句，该子句使用介于和基于另一个日期的两个界限之间的日期来指定其中一个日期。如果直接在数据库中这样做是可行的，或者使用像SQLite这样的内存数据库，我建议使用它。您需要在pandas中进行的黑客攻击将是不好的，如果您使用数据库方式进行攻击，您仍然可以在之后将结果拉出来给pandas进行交互处理或其他任何操作。@Lance是否可以保证两个数据帧分别包含真正唯一的会话？i、 e.重复数据消除是否仅在合并它们时适用？或者同一数据帧中的“同一”会话是否可能有两行时间戳略有不同？对不起，我还是不明白。在单个数据帧内，您是否需要执行会话重复数据消除（考虑时间戳的微小差异）？对于我来说，这是一个很好的开始。关于您的第一个解决方案，我们是否可以包括一个正负间隔范围，以防止事件位于间隔的错误一侧？间隔将是一个字符串，如我键入的示例中所示。不确定逻辑是否100%正确，但我让它在excel中处理测试数据。我想你的也会遇到同样的问题。考虑到将一个连续的时间范围映射成离散的区间。这意味着您可以始终想到一对时间戳，它们在连续范围内足够接近，但落入不同的inte >>> df1 date_start_time employee_id session_id 0 2016-01-01 02:03:00 7261824 871631182 1 2016-01-01 06:03:00 7261824 871631183 2 2016-01-01 11:01:00 7261824 871631184 3 2016-01-01 14:01:00 7261824 871631185 >>> df2 date_start_time employee_id session_id 0 2016-01-01 02:03:00 7261824 871631182 1 2016-01-01 06:05:00 7261824 871631183 2 2016-01-01 11:04:00 7261824 871631184 3 2016-01-01 14:10:00 7261824 871631185 import numpy as np # half-threshold in nanoseconds threshold_ns = threshold * 60 * 1e9 # compute "interval" to which each session belongs df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns) df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns) # join cols = ['interval', 'employee_id', 'session_id'] print df1.merge(df2, on=cols, how='outer')[cols] interval employee_id session_id 0 2016-01-01 02:05:00 7261824 871631182 1 2016-01-01 06:05:00 7261824 871631183 2 2016-01-01 11:00:00 7261824 871631184 3 2016-01-01 14:00:00 7261824 871631185 4 2016-01-01 11:05:00 7261824 871631184 5 2016-01-01 14:10:00 7261824 871631185 from datetime import timedelta # get closest match from "df2" to row from "df1" (as long as it's below the threshold) def closest(row): matches = df2.loc[(df2.employee_id == row.employee_id) & (df2.session_id == row.session_id)] deltas = matches.date_start_time - row.date_start_time deltas = deltas.loc[deltas <= timedelta(minutes=threshold)] try: return matches.loc[deltas.idxmin()] except ValueError: # no items return row # replace timestamps in "df1" with closest timestamps in "df2" df1 = df1.apply(closest, axis=1) # join cols = ['date_start_time', 'employee_id', 'session_id'] print df1.merge(df2, on=cols, how='outer')[cols] date_start_time employee_id session_id 0 2016-01-01 02:03:00 7261824 871631182 1 2016-01-01 06:05:00 7261824 871631183 2 2016-01-01 11:04:00 7261824 871631184 3 2016-01-01 14:01:00 7261824 871631185 4 2016-01-01 14:10:00 7261824 871631185 date_start_time date_start_time_x employee_id_x session_id_x date_start_time_y employee_id_y session_id_y 2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182 2016-01-01 02:03:00 7261824.0 871631182.0 2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183 2016-01-01 06:05:00 7261824.0 871631183.0 2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184 2016-01-01 11:04:00 7261824.0 871631184.0 2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185 NaT NaN NaN