Python 如何用标点符号拆分熊猫中的字符串

Python 如何用标点符号拆分熊猫中的字符串,python,pandas,Python,Pandas,我有一个如下所示的数据帧: word start stop speaker 0 but, 2.72 2.85 2 1 that's 2.85 3.09 2 2 alright 3.09 3.47 2 3 we'll 8.43 8.69 1 4 have 8.69 8.97 1 5 to 8.97 9.07

我有一个如下所示的数据帧:

      word    start  stop      speaker
0      but,   2.72  2.85        2
1    that's   2.85  3.09        2
2   alright   3.09  3.47        2
3     we'll   8.43  8.69        1
4      have   8.69  8.97        1
5        to   8.97  9.07        1
6     okay,   9.19 10.01        2
7     sure.  10.02 11.01        2
8     what?  11.02 12.00        1
9         i  12.01 13.00        2
10    agree, 13.01 14.00        2
11       but 14.01 15.00        2
12       i   15.01 16.00        2
13  disagree 16.01 17.00        2
14   that's  17.01 18.00        1
15    fine,  18.01 19.00        1 
16   however 19.01 20.00        1         
17       you 20.01 21.00        1
18       are 21.01 22.00        1
每当出现说话人变化或标点符号时(不包括撇号),我想将“word”中的所有单词组合在一起。除了分组单词外,我还希望将第一个单词“start”和最后一个单词“stop”分配给组。我想要的如下所示:

       word        start  stop speaker
0                but,  2.72  2.85  2
1      that's alright  2.85  3.47  2
2       we'll have to  8.43  9.07  1
3               okay,  9.19  10.01 2
4               sure. 10.02  11.01 2
5               what? 11.02  12.00 1
6            I agree, 12.01  14.00 2
7      but i disagree 14.01  17.00 2
8        that's fine, 17.01  19.00 1
9     however you are 19.01  22.00 1

如果您对完成此任务有任何建议,我们将不胜感激。

您可以检查最后一个字符是否在标点符号列表中,并按倒数总和分组:

punctuation = list(',.?!')

s = (df['word'].str.strip().str[-1].isin(punctuation) # punctuation
     | df['speaker'].ne(df['speaker'].shift(-1))      # speaker change
    )
s = s.iloc[::-1].cumsum().iloc[::-1]

# reverse order of s
s = s.max()-s

df.groupby(s).agg({'word':' '.join, 'start':'min', 'stop':'max', 'speaker': 'min'})
输出:

              word  start   stop  speaker
0             but,   2.72   2.85        2
1   that's alright   2.85   3.47        2
2    we'll have to   8.43   9.07        1
3            okay,   9.19  10.01        2
4            sure.  10.02  11.01        2
5            what?  11.02  12.00        1
6         i agree,  12.01  14.00        2
7   but i disagree  14.01  17.00        2
8     that's fine,  17.01  19.00        1
9  however you are  19.01  22.00        1

尝试
str.extract
s1
s2
agg

s = df.word.str.extract(r"([^\w\s'])", expand=False).notna()
s1 = s.cumsum() - s
s2 = df.speaker.diff().ne(0).cumsum()

(df.groupby([s1, s2], sort=False, as_index=False)
   .agg({'word': ' '.join, 'start': 'first', 'stop': 'last', 'speaker': 'first'}))

Out[70]:
              word  start   stop  speaker
0             but,   2.72   2.85        2
1   that's alright   2.85   3.47        2
2    we'll have to   8.43   9.07        1
3            okay,   9.19  10.01        2
4            sure.  10.02  11.01        2
5            what?  11.02  12.00        1
6         i agree,  12.01  14.00        2
7   but i disagree  14.01  17.00        2
8     that's fine,  17.01  19.00        1
9  however you are  19.01  22.00        1

我看不出有什么理由
但是你是这样的,而
不在同一句话中?谢谢你的关注。那是我的错别字。我编辑了我的示例以反映这一点。