Python 如何用标点符号拆分熊猫中的字符串
我有一个如下所示的数据帧:Python 如何用标点符号拆分熊猫中的字符串,python,pandas,Python,Pandas,我有一个如下所示的数据帧: word start stop speaker 0 but, 2.72 2.85 2 1 that's 2.85 3.09 2 2 alright 3.09 3.47 2 3 we'll 8.43 8.69 1 4 have 8.69 8.97 1 5 to 8.97 9.07
word start stop speaker
0 but, 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay, 9.19 10.01 2
7 sure. 10.02 11.01 2
8 what? 11.02 12.00 1
9 i 12.01 13.00 2
10 agree, 13.01 14.00 2
11 but 14.01 15.00 2
12 i 15.01 16.00 2
13 disagree 16.01 17.00 2
14 that's 17.01 18.00 1
15 fine, 18.01 19.00 1
16 however 19.01 20.00 1
17 you 20.01 21.00 1
18 are 21.01 22.00 1
每当出现说话人变化或标点符号时(不包括撇号),我想将“word”中的所有单词组合在一起。除了分组单词外,我还希望将第一个单词“start”和最后一个单词“stop”分配给组。我想要的如下所示:
word start stop speaker
0 but, 2.72 2.85 2
1 that's alright 2.85 3.47 2
2 we'll have to 8.43 9.07 1
3 okay, 9.19 10.01 2
4 sure. 10.02 11.01 2
5 what? 11.02 12.00 1
6 I agree, 12.01 14.00 2
7 but i disagree 14.01 17.00 2
8 that's fine, 17.01 19.00 1
9 however you are 19.01 22.00 1
如果您对完成此任务有任何建议,我们将不胜感激。您可以检查最后一个字符是否在标点符号列表中,并按倒数总和分组:
punctuation = list(',.?!')
s = (df['word'].str.strip().str[-1].isin(punctuation) # punctuation
| df['speaker'].ne(df['speaker'].shift(-1)) # speaker change
)
s = s.iloc[::-1].cumsum().iloc[::-1]
# reverse order of s
s = s.max()-s
df.groupby(s).agg({'word':' '.join, 'start':'min', 'stop':'max', 'speaker': 'min'})
输出:
word start stop speaker
0 but, 2.72 2.85 2
1 that's alright 2.85 3.47 2
2 we'll have to 8.43 9.07 1
3 okay, 9.19 10.01 2
4 sure. 10.02 11.01 2
5 what? 11.02 12.00 1
6 i agree, 12.01 14.00 2
7 but i disagree 14.01 17.00 2
8 that's fine, 17.01 19.00 1
9 however you are 19.01 22.00 1
尝试
str.extract
,s1
,s2
和agg
s = df.word.str.extract(r"([^\w\s'])", expand=False).notna()
s1 = s.cumsum() - s
s2 = df.speaker.diff().ne(0).cumsum()
(df.groupby([s1, s2], sort=False, as_index=False)
.agg({'word': ' '.join, 'start': 'first', 'stop': 'last', 'speaker': 'first'}))
Out[70]:
word start stop speaker
0 but, 2.72 2.85 2
1 that's alright 2.85 3.47 2
2 we'll have to 8.43 9.07 1
3 okay, 9.19 10.01 2
4 sure. 10.02 11.01 2
5 what? 11.02 12.00 1
6 i agree, 12.01 14.00 2
7 but i disagree 14.01 17.00 2
8 that's fine, 17.01 19.00 1
9 however you are 19.01 22.00 1
我看不出有什么理由
但是你是这样的,而不在同一句话中?谢谢你的关注。那是我的错别字。我编辑了我的示例以反映这一点。