Pandas Python-使用.iterrows（）以外的方法在数据帧中循环_Pandas

Pandas Python-使用.iterrows（）以外的方法在数据帧中循环

pandas

Pandas Python-使用.iterrows（）以外的方法在数据帧中循环,pandas,Pandas,以下是简化数据集： Character x0 x1 0 T 0.0 1.0 1 h 1.1 2.1 2 i 2.2 3.2 3 s 3.3 4.3 5 i 5.5 6.5 6 s 6.6 7.6 8 a 8.8 9.8 10 s 11.0 12.0 11 a 1

以下是简化数据集：

   Character    x0    x1
0          T   0.0   1.0
1          h   1.1   2.1
2          i   2.2   3.2
3          s   3.3   4.3
5          i   5.5   6.5
6          s   6.6   7.6
8          a   8.8   9.8
10         s  11.0  12.0
11         a  12.1  13.1
12         m  13.2  14.2
13         p  14.3  15.3
14         l  15.4  16.4
15         e  16.5  17.5
16         .  17.6  18.6

简化数据集由以下代码生成：

ch = ['T']
x0 = [0]
x1 = [1]
string = 'his is a sample.'
for s in string:
    ch.append(s)
    x0.append(round(x1[-1]+0.1,1))
    x1.append(round(x0[-1]+1,1))

df = pd.DataFrame(list(zip(ch, x0, x1)), columns = ['Character', 'x0', 'x1'])
df = df.drop(df.loc[df['Character'] == ' '].index)

x0和x1分别表示每个字符的开始和结束位置。假设任意两个相邻字符之间的距离等于0.1。换句话说，如果一个字符的x0和前一个字符的x1之间的差值为0.1，则这两个字符属于同一字符串。如果这种差异大于0.1，则字符应该是新字符串的开头，等等。我需要生成字符串及其各自的x0和x1的数据帧，这是通过使用.iterrows（）在数据帧中循环完成的

结果如下：

    String    x0    x1
0     This   0.0   4.3
1       is   5.5   7.6
2        a   8.8   9.8
3  sample.  11.0  18.6

有没有其他更快的方法来实现这一点？

您可以使用+：

输出

  Character    x0    x1
0      This   0.0   4.3
1        is   5.5   7.6
2         a   8.8   9.8
3   sample.  11.0  18.6

棘手的部分是：

# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()

其思想是将diff（same）列转换为True或False列，其中每次出现True都意味着需要创建一个新组。

cumsum

将负责为每个组分配相同的id

按照@ShubhamSharma的建议，您可以：

# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)

# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()

另一部分保持不变。

答案不错，也许你可以

将值进行四舍五入

直到一个固定的精度，然后像

（df['x0']-df['x1'].shift（）.fillna（df['x0']）。四舍五入（3）。gt（.1）

@ShubhamSharma包括你的建议，谢谢！很好，谢谢你们。然而，iterrows（）的运行速度似乎仍然更快（每循环1.55 ms±34.8µs（平均±标准偏差7次，每个循环1000次）），而groupby+agg的运行速度为（每循环3 ms±147µs（平均±标准偏差7次，每个循环100次））。有更快的选择吗？嗨@IvanC我真的很惊讶ItErrors更快，它通常非常慢。你是用什么数据测试的？嗨@DaniMesejo，你说得对！当我将数据集增加到17000行左右时，iterrows每个循环运行2.1 s±179 ms（平均±标准偏差7次，每个循环1次），而groupby+agg给出（62.9 ms±1.6 ms每个循环（平均±标准偏差7次，每个循环10次））。

# create grouper column, had to use this because of problems with floating point
grouper = ((same - 0.1) > 0.00001).cumsum()

# create diff column
same = (df['x0'] - df['x1'].shift().fillna(df['x0'])).abs().round(3).gt(.1)

# create grouper column, had to use this because of problems with floating point
grouper = same.cumsum()