Pandas 字符串列上的滚动和_Pandas_Text_Rolling Sum

Pandas 字符串列上的滚动和

pandas text

Pandas 字符串列上的滚动和,pandas,text,rolling-sum,Pandas,Text,Rolling Sum,我正在使用Python3和pandas版本“0.19.2” 我的意见如下： chat_id line 1 'Hi.' 1 'Hi, how are you?.' 1 'I'm well, thanks.' 2 'Is it going to rain?.' 2 'No, I don't think so.' 我想按“chat_id”分组，然后在“line”上做一些类似滚动求和的操作，以获得以下信息：

我正在使用Python3和pandas版本“0.19.2”

我的意见如下：

chat_id    line
1          'Hi.'
1          'Hi, how are you?.'
1          'I'm well, thanks.'
2          'Is it going to rain?.'
2          'No, I don't think so.'

我想按“chat_id”分组，然后在“line”上做一些类似滚动求和的操作，以获得以下信息：

chat_id    line                     conversation
1          'Hi.'                    'Hi.'
1          'Hi, how are you?.'      'Hi. Hi, how are you?.'
1          'I'm well, thanks.'      'Hi. Hi, how are you?. I'm well, thanks.'
2          'Is it going to rain?.'  'Is it going to rain?.'
2          'No, I don't think so.'  'Is it going to rain?. No, I don't think so.'

我相信df.groupby（'chat_id'）['line'].cumsum（）只适用于数字列

我还尝试了df.groupby（by=['chat\u id']，as\u index=False）['line']。apply（list）获取完整对话中所有行的列表，但是我不知道如何解压缩该列表以创建“滚动求和”风格的对话列。

对于我来说，如果需要，可以使用分隔符添加

空格

：

df['new'] = df.groupby('chat_id')['line'].apply(lambda x: (x + ' ').cumsum().str.strip())
print (df)
   chat_id                   line                                          new
0        1                    Hi.                                          Hi.
1        1      Hi, how are you?.                        Hi. Hi, how are you?.
2        1      I'm well, thanks.      Hi. Hi, how are you?. I'm well, thanks.
3        2  Is it going to rain?.                        Is it going to rain?.
4        2  No, I don't think so.  Is it going to rain?. No, I don't think so.

有趣

cumsum

在序列上调用时有效，但在groupby对象上调用时会引发错误。对我来说，这会导致：ValueError:无法从重复的Axis重新编制索引您的pandas版本是什么<代码>打印（pd.show_versions（））。因为我无法模拟你的错误。我在值中测试了重复项，在索引中测试了重复项，所有这些在版本

0.19.2

中都能正常工作。对不起，你说得对。我必须在df上重置_index（），然后它才能工作。如果对话之间有一个

NaN

值（例如

index 1

），我如何从

cumsum

中排除它？谢谢@TotoLele-One idea

df['new']=df.dropna（subset=['line']）.groupby（'chat_id'）['line']）.apply（lambda x:'x+'.cumsum（）.str.strip（））

df['line'] = df['line'].str.strip("'")
df['new'] = df.groupby('chat_id')['line'].apply(lambda x: "'" + (x + ' ').cumsum().str.strip() + "'")
print (df)
   chat_id                   line  \
0        1                    Hi.   
1        1      Hi, how are you?.   
2        1      I'm well, thanks.   
3        2  Is it going to rain?.   
4        2  No, I don't think so.   

                                             new  
0                                          'Hi.'  
1                        'Hi. Hi, how are you?.'  
2      'Hi. Hi, how are you?. I'm well, thanks.'  
3                        'Is it going to rain?.'  
4  'Is it going to rain?. No, I don't think so.'