Python 有没有一种更快的方法来遍历数据帧？_Python_Pandas_Loops_Dataframe

Python 有没有一种更快的方法来遍历数据帧？

python pandas loops dataframe

Python 有没有一种更快的方法来遍历数据帧？,python,pandas,loops,dataframe,Python,Pandas,Loops,Dataframe,我正在浏览一个时间段的数据框，试图将每个时间段与一天中的其他时间段进行比较，以找到双人预订脚本需要一段时间才能运行。有没有更快的方法 df_temp = pd.DataFrame() for date in df_cal["date"].unique(): df_date = df_cal[df_cal["date"]==date] for current in range(len(df_date)): for comp in range(current+1,d

我正在浏览一个时间段的数据框，试图将每个时间段与一天中的其他时间段进行比较，以找到双人预订

脚本需要一段时间才能运行。有没有更快的方法

df_temp = pd.DataFrame()
for date in df_cal["date"].unique():
    df_date = df_cal[df_cal["date"]==date]
    for current in range(len(df_date)):
        for comp in range(current+1,df_date[df_date["Start"]<df_date.iloc[current]["End"]]["Start"].idxmax()+1):
            df_date.loc[comp,"Double booked"] = True
            df_date.loc[current,"Double booked"] = True
            df_date.loc[comp,"Time_removed"] = max(df_date.loc[comp,"Time_removed"],pd.Timedelta(min(df_date.iloc[current]["End"] - df_date.iloc[comp]["Start"],\
                                                           df_date.iloc[comp]["End"] - df_date.iloc[comp]["Start"])))

    df_temp = pd.concat([df_temp,df_date])

然后会产生类似这样的结果，重复预订的会议被标记为重复预订的会议，重叠时间从其中一个会议中删除（这里从第二个会议中删除）列为[[“会面ID”、“开始”、“结束”、“删除时间”、“预订两次”]]

编辑新数据2018年7月9日：

    Start               End                 Time_removed  Double booked
77  2018-07-02 00:00:00 2018-07-02 10:00:00 00:00:00      True
78  2018-07-02 03:00:00 2018-07-02 08:00:00 05:00:00      True
79  2018-07-02 03:00:00 2018-07-02 08:00:00 05:00:00      True
80  2018-07-02 04:30:00 2018-07-02 09:30:00 03:30:00      True
81  2018-07-02 05:00:00 2018-07-02 10:00:00 04:30:00      True
82  2018-07-02 05:00:00 2018-07-02 10:00:00 05:00:00      True

第80行应删除5小时，但仅删除3:30，因为它与前面的一行相比。它之前必须计算出在第77行和第80行之间删除的时间_，但随后它会被较小的时间差替换。

看起来像是一个作业。您还可以使用来消除内部双

for

循环

def process_data(df):
    pos = np.argwhere(np.less.outer(df['start'], df['end']))
    indices = df.index[pos]
    unique = indices.ravel().unique()
    date_diff = np.subtract.outer(df['end'], df['start']).max(axis=0)
    return pd.DataFrame(
        data=np.asarray([
            [True]*len(indices),
            np.where(
                np.isin(unique, indices[:, 1]),
                date_diff,
                np.NaN
            )
        ]).T,
        columns=['Double booked', 'Time_removed'],
        index=unique
    )

df_cal.groupby('date').apply(process_data)

在任何情况下，这只基于OP的片段，没有任何示例数据帧和示例输出（即某种单元测试），很难说它是否真的解决了问题

此外，您必须确保不要混淆索引和位置。在你的问题中，你似乎混合了

.loc

和

.iloc

以及

范围的用法。我不确定这是否会产生你想要的结果
编辑
从添加到OP的数据来看，'Date'
变量实际上依赖于'Start'
变量（即仅为'Start'
日期时间值的日期）。考虑到这一点，我们可以不用应用groupby
，直接应用外部产品以获得重叠项：
overlapping = np.less_equal.outer(df['Start'], df['Start']) & np.greater.outer(df['End'], df['Start'])
overlapping &= ~np.identity(len(df), dtype=bool)  # Meetings are overlapping with themselves; need to remove.
overlapping_indices = df.index[np.argwhere(overlapping)].values

df.loc[
    np.unique(overlapping_indices.ravel()),
    'double_booked'
] = True

df.loc[
    overlapping_indices[:, 1],
    'Time_removed'
] = (
    np.minimum(df.loc[overlapping_indices[:, 0], 'End'], df.loc[overlapping_indices[:, 1], 'End'])
    - np.maximum(df.loc[overlapping_indices[:, 0], 'Start'], df.loc[overlapping_indices[:, 1], 'Start'])
).values

但是，从示例数据来看，不清楚如何将重叠会议标记为双重预订。对于12:30:00-13:00:00
会议，您只标记了第二次会议，而对于13:00:00-16:00:00
和14:30:00-15:30:00
会议，您将两次会议都标记为双重预订
编辑2
为了考虑多个（3个）重叠会议，我们需要计算所有对会议的重叠时间，然后考虑那些具有（正）重叠的重叠的最大重叠。以下解决方案要求按开始时间对数据进行排序：
# This requires the data frame to be sorted by `Start` time.

start_times = np.tile(df['Start'].values, (len(df), 1))
end_times = np.tile(df['End'].values, (len(df), 1))
overlap_times = np.triu(np.minimum(end_times, end_times.T) - np.maximum(start_times, start_times.T))
overlap_times[np.diag_indices(len(overlap_times))] = np.timedelta64(0)
overlap_indices = df.index[np.argwhere(overlap_times > np.timedelta64(0))]
overlaps_others_indices = np.unique(overlap_indices[:, 1])

df.loc[
    np.unique(overlap_indices.ravel()),
    'double_booked'
] = True

df.loc[
    overlaps_others_indices,
    'Time_removed'
] = pd.Series(overlap_times.max(axis=0), index=df.index)[overlaps_others_indices]

使用df.iterrows（）迭代Dataframe中的行。您能添加数据样本吗？开始和结束是日期时间，Double booked是布尔值，Time_removed是时间增量非常感谢您！我认为groupby是正确的选择，但是，它仍然不太管用。我在上面添加了一些输入和输出数据来展示它是如何工作的。非常感谢！我想将两次会议都标记为双重预订，但只删除最晚开始的会议的时间。当它们都是12:30-13:00时，这是一个问题，因为它从两者中移除了0:30，但我修复了它，如图所示：df_cal.loc[np.max（重叠索引[：，0]，重叠索引[：，1]），“Time_removed”]=（
再次感谢您的帮助。您有任何在线资源可以推荐帮助我改进吗？@GustavFjorder使用loc
（带索引）为了访问后面的会议，您需要在数据库中有序地进行会议。更安全的方法是只考虑<代码>的上三角重叠/<代码>矩阵（稍后的会议被存储为列）。因此，在提取索引之前，可以执行代码>重叠= NP。

这将把下三角设置为零（只考虑以后的会议）。至于资源，我建议大家熟悉numpy，因为它可以极大地提高性能（只需在线搜索numpy教程）。真棒：）我只是注意到，当我们有三次预订时，删除的时间太少。我用一个例子更新了上面的数据，其中所有会议都在00:00-10:00之间，因此只有第一次会议有时间，所有其他会议都应该删除，但事实并非如此。@GustavFjorder查看我更新的答案。这现在需要数据框但是，可以按

Start

时间进行排序（这样重叠的上三角确实代表“以后”的会议）。也许也可以不预先排序就完成，但最好单独问一个问题。

overlapping = np.less_equal.outer(df['Start'], df['Start']) & np.greater.outer(df['End'], df['Start'])
overlapping &= ~np.identity(len(df), dtype=bool)  # Meetings are overlapping with themselves; need to remove.
overlapping_indices = df.index[np.argwhere(overlapping)].values

df.loc[
    np.unique(overlapping_indices.ravel()),
    'double_booked'
] = True

df.loc[
    overlapping_indices[:, 1],
    'Time_removed'
] = (
    np.minimum(df.loc[overlapping_indices[:, 0], 'End'], df.loc[overlapping_indices[:, 1], 'End'])
    - np.maximum(df.loc[overlapping_indices[:, 0], 'Start'], df.loc[overlapping_indices[:, 1], 'Start'])
).values

# This requires the data frame to be sorted by `Start` time.

start_times = np.tile(df['Start'].values, (len(df), 1))
end_times = np.tile(df['End'].values, (len(df), 1))
overlap_times = np.triu(np.minimum(end_times, end_times.T) - np.maximum(start_times, start_times.T))
overlap_times[np.diag_indices(len(overlap_times))] = np.timedelta64(0)
overlap_indices = df.index[np.argwhere(overlap_times > np.timedelta64(0))]
overlaps_others_indices = np.unique(overlap_indices[:, 1])

df.loc[
    np.unique(overlap_indices.ravel()),
    'double_booked'
] = True

df.loc[
    overlaps_others_indices,
    'Time_removed'
] = pd.Series(overlap_times.max(axis=0), index=df.index)[overlaps_others_indices]