提高Python对带有数据帧的循环的性能请考虑以下数据文件DF: timestamp id condition 1234 A 2323 B 3843 B 1234 C 8574 A 9483 A_Python_Performance_Pandas

提高Python对带有数据帧的循环的性能请考虑以下数据文件DF: timestamp id condition 1234 A 2323 B 3843 B 1234 C 8574 A 9483 A

python performance pandas

提高Python对带有数据帧的循环的性能请考虑以下数据文件DF: timestamp id condition 1234 A 2323 B 3843 B 1234 C 8574 A 9483 A,python,performance,pandas,Python,Performance,Pandas,根据列条件中包含的条件，我必须在此数据帧中定义一个新列，该列统计处于该条件下的ID数。但是，请注意，由于数据帧是按时间戳列排序的，因此可以有多个相同id的条目，而简单的.cumsum（）不是可行的选项我已经给出了以下代码，它工作正常，但速度非常慢： #I start defining empty arrays ids_with_condition_a = np.empty(0) ids_with_condition_b = np.empty(0) ids_with_condition_c =

根据列条件中包含的条件，我必须在此数据帧中定义一个新列，该列统计处于该条件下的ID数。但是，请注意，由于数据帧是按时间戳列排序的，因此可以有多个相同id的条目，而简单的.cumsum（）不是可行的选项

我已经给出了以下代码，它工作正常，但速度非常慢：

#I start defining empty arrays
ids_with_condition_a = np.empty(0)
ids_with_condition_b = np.empty(0)
ids_with_condition_c = np.empty(0)

#Initializing new column
df['count'] = 0

#Using a for loop to do the task, but this is sooo slow!
for r in range(0, df.shape[0]):
    if df.condition[r] == 'A':
        ids_with_condition_a = np.append(ids_with_condition_a, df.id[r])
    elif df.condition[r] == 'B':
        ids_with_condition_b = np.append(ids_with_condition_b, df.id[r])
        ids_with_condition_a = np.setdiff1d(ids_with_condition_a, ids_with_condition_b)
    elifif df.condition[r] == 'C':
        ids_with_condition_c = np.append(ids_with_condition_c, df.id[r])

df.count[r] = ids_with_condition_a.size

保留这些Numpy数组对我非常有用，因为它提供了特定条件下的id列表。我还可以把这些数组放到df数据帧中相应的单元中

您是否能够在性能方面提出更好的解决方案？

您需要在“条件”列上使用

groupby

，并使用

cumcount

来计算截至当前行的每个条件下有多少ID（这似乎是您的代码所做的）：

通过输入示例，您可以获得：

     id condition  count
0  1234         A      1
1  2323         B      1
2  3843         B      2
3  1234         C      1
4  8574         A      2
5  9483         A      3

这比对

如果你只想得到带有条件A的行，你可以使用一个掩码，比如，如果你这样做了

打印（df[df['condition']='A']）

，您可以看到只有条件egal到A的行。因此，要获得数组

arr_A = df.loc[df['condition'] == 'A','id'].values
print (arr_A)
array([1234, 8574, 9483])

编辑：要为每个条件创建两列，您可以对条件A执行以下操作：

# put 1 in a column where the condition is met
df['nb_cond_A'] = pd.np.where(df['condition'] == 'A',1,None)
# then use cumsum for increment number, ffill to fill the same number down
# where the condition is not meet, fillna(0) for filling other missing values
df['nb_cond_A'] = df['nb_cond_A'].cumsum().ffill().fillna(0).astype(int)
# for the partial list, first create the full array
arr_A = df.loc[df['condition'] == 'A','id'].values
# create the column with apply (here another might exist, but it's one way)
df['partial_arr_A'] = df['nb_cond_A'].apply(lambda x: arr_A[:x])

输出如下所示：

     id condition  nb_condition_A       partial_arr_A  nb_cond_A
0  1234         A               1              [1234]          1
1  2323         B               1              [1234]          1
2  3843         B               1              [1234]          1
3  1234         C               1              [1234]          1
4  8574         A               2        [1234, 8574]          2
5  9483         A               3  [1234, 8574, 9483]          3

     id condition             A             B       C  len_A  len_B  len_C
0  1234         A        [1234]            []      []      1      0      0
1  2323         B        [1234]        [2323]      []      1      1      0
2  3843         B        [1234]  [2323, 3843]      []      1      2      0
3  1234         C            []  [2323, 3843]  [1234]      0      2      1
4  8574         A        [8574]  [2323, 3843]  [1234]      1      2      1
5  9483         A  [8574, 9483]  [2323, 3843]  [1234]      2      2      1

对于B，C，同样的事情。可能对于集合中的cond（df['condition']）使用循环

，

对于一般化来说是可行的

编辑2：一个想法是做你在评论中解释的事情，但不确定它是否能提高性能：

# array of unique condition
arr_cond = df.condition.unique()
#use apply to create row-wise the list of ids for each condition
df[arr_cond] = (df.apply(lambda row: (df.loc[:row.name].drop_duplicates('id','last')
                                          .groupby('condition').id.apply(list)) ,axis=1)
                  .applymap(lambda x: [] if not isinstance(x,list) else x))

一些解释：对于每一行，选择该行之前的数据帧

loc[：row.name]

，删除重复的'id'，并保留最后一个

drop_duplicates（'id'，'last'）

（在您的示例中，这意味着一旦我们到达第3行，就会删除第0行，因为ID1234是两次），然后根据条件对数据进行分组

groupby>（'condition'）

，每个条件的id都放在同一个列表中

id.apply（list）

。以

applymap

fillna开头的部分使用空列表（不能使用fillna（[]），这是不可能的）

对于每个条件的长度，可以执行以下操作：

for cond in arr_cond:
    df['len_{}'.format(cond)] = df[cond].str.len().fillna(0).astype(int)

结果如下：

     id condition  nb_condition_A       partial_arr_A  nb_cond_A
0  1234         A               1              [1234]          1
1  2323         B               1              [1234]          1
2  3843         B               1              [1234]          1
3  1234         C               1              [1234]          1
4  8574         A               2        [1234, 8574]          2
5  9483         A               3  [1234, 8574, 9483]          3

     id condition             A             B       C  len_A  len_B  len_C
0  1234         A        [1234]            []      []      1      0      0
1  2323         B        [1234]        [2323]      []      1      1      0
2  3843         B        [1234]  [2323, 3843]      []      1      2      0
3  1234         C            []  [2323, 3843]  [1234]      0      2      1
4  8574         A        [8574]  [2323, 3843]  [1234]      1      2      1
5  9483         A  [8574, 9483]  [2323, 3843]  [1234]      2      2      1

预期输出是什么？您好，我应该承认您的解决方案非常优雅，感谢您提高了我的知识。我想了解是否可以在每行（在专用列中）存储arr_A为了获得每个时间戳中满足特定条件的ID列表，老实说，我对这个数组的大小更感兴趣，我需要跟踪每个时间戳中在一个条件到另一个条件之间变化的ID的数量timestamp@espogian关于时间戳的事情对我来说并不像你的例子中那样清楚e没有（列名称旁边）.您是说您给出的示例仅针对一个时间戳，并且您有其他带有其他几行id和条件的时间戳？每行都有自己的时间戳，例如每秒一行。目的是了解，例如在第3行，有多少id满足条件A，有多少条件B，etc@espogian所以可以肯定的是，你会每个条件2列，其中一列的行数在当前行之前满足此条件（即使此行不是相同的条件）在这一行之前，有一列ID列表满足此条件？我仍然无法做的最好的事情是，每个条件有一列ID列表满足此条件，每个条件有一列ID数量满足此条件