Python 在索引上一行或下一行时,是否可以避免数据帧行循环?

Python 在索引上一行或下一行时,是否可以避免数据帧行循环?,python,pandas,loops,Python,Pandas,Loops,我有一个数据集,每次它达到零时,我都要为它单独指定一个唯一的值 我想出的代码似乎很慢,我怀疑一定有一种更快的方法 import time import pandas as pd import numpy as np #-------------------------------- # DEBUG TEST DATASET #-------------------------------- #Create random test data series_random = np.rand

我有一个数据集,每次它达到零时,我都要为它单独指定一个唯一的值

我想出的代码似乎很慢,我怀疑一定有一种更快的方法

import time
import pandas as pd
import numpy as np

#--------------------------------
#     DEBUG TEST DATASET
#--------------------------------
#Create random test data
series_random = np.random.randint(low=1, high=10, size=(10000,1))

#Insert zeros at known points (this should result in six motion IDs)
series_random[[5,6,7,15,100,2000,5000]] = 0

#Create data frame from test series
df = pd.DataFrame(series_random, columns=['Speed'])
#--------------------------------

#Elaped time counter
Elapsed_ms = time.time()

#Set Motion ID variable
Motion_ID = 0

#Create series with Motion IDs
df.loc[:,'Motion ID'] = 0

#Iterate through each row of df
for i in range(df.index.min()+1, df.index.max()+1):

    #Set Motion ID to latest value
    df.loc[i, 'Motion ID'] = Motion_ID

    #If previous speed was zero and current speed is >0, then new motion detected        
    if df.loc[i-1, 'Speed'] == 0 and df.loc[i, 'Speed'] > 0:
        Motion_ID += 1
        df.loc[i, 'Motion ID'] = Motion_ID

        #Include first zero value in new Motion ID (for plotting purposes)
        df.loc[i-1, 'Motion ID'] = Motion_ID

Elapsed_ms = int((time.time() - Elapsed_ms) * 1000)

print('Result: {} records checked, {} unique trips identified in {} ms'.format(len(df.index),df['Motion ID'].nunique(),Elapsed_ms))
上述代码的输出为:

结果:检查了10000条记录,在6879毫秒内确定了6次独特的行程


我的实际数据集要大得多,因此,即使在这个小示例中,我也很惊讶它花费了这么长时间才完成一个看似简单的操作。

您可以在numpy中使用布尔数组和表达式来表达逻辑,而无需任何循环:

def get_motion_id(speed):
    mask = np.zeros(speed.size, dtype=bool)

    # mask[i] == True if Speed[i - 1] == 0 and Speed[i] > 0
    mask[1:] = speed[:-1] == 0
    mask &= speed > 0

    # Taking the cumsum increases the motion_id by one where mask is True
    motion_id = mask.astype(int).cumsum()
    # Carry over beginning of a motion to the preceding step with Speed == 0
    motion_id[:-1] = motion_id[1:]
    return motion_id


# small demo example
df = pd.DataFrame({'Speed': [3, 0, 1, 2, 0, 1]})
df['Motion_ID'] = get_motion_id(df['Speed'])
print(df)
   Speed  Motion_ID
0      3          0
1      0          1
2      1          1
3      2          1
4      0          2
5      1          2
对于您的10000行示例,我看到速度提高了800左右:

%time df['Motion_ID'] = get_motion_id(df['Speed'])
CPU times: user 5.26 ms, sys: 3.18 ms, total: 8.43 ms
Wall time: 8.01 ms

另一种方法是从
df
中提取索引值0,然后迭代这些索引值以检查并分配
运动Id
的值。检查以下代码:

Motion_ID = 0

#Create series with Motion IDs
df.loc[:,'Motion ID'] = 0
i=0
for index_val in sorted(df[df['Speed'] == 0].index):
    df.loc[i:index_val,'Motion ID'] = Motion_ID
    i = index_val
    if df.loc[index_val+1, 'Speed'] > 0:
        Motion_ID += 1

df.loc[i:df.index.max(),'Motion ID'] = Motion_ID+1
#Iterate through each row of df
输出:

Result: 10000 records checked, 6 unique trips identified in 49 ms

您可以使用while循环完成此操作,并继续执行,直到找到下一个0。为之前的所有值指定一个唯一的数字,并为下一段指定另一个数字。例如,从0开始,您在第9位找到了0。用0填充数据帧直到第9个位置,然后从第9个位置开始重复相同的操作。我喜欢这种方法,因为添加更多的条件非常容易,而且根本没有循环。谢谢@RockyK是的,我认为在python中使用numpy数组通常非常适合解决这类问题,因为numpy比普通python快得多,即使从算法的角度来看这是一种浪费(额外的内存,有时是不必要的循环等)。尽管它仍然使用循环,但速度要快得多。但最后一行的第二行将导致最后一个运动ID与前一行的值相差+2。容易解决。