Python 基于行簇聚合文本_Python_Pandas

Python 基于行簇聚合文本

python pandas

Python 基于行簇聚合文本,python,pandas,Python,Pandas,我有一个熊猫数据框：我试图得出以下结论：我可以用这段代码完成一些事情： pd.concat([df['text'].reset_index(drop=True), df['text'].shift(-1).reset_index(drop=True)], axis=1) 但是，这不会基于is_from_me组合文本，其中组的文本与分隔原始字符串的换行符组合。这是一个过于简单的示例，可能有超过2行被分组到一行中我试着想出一个简单的方法来定义这个分组，但我能管理的只是一个复杂的for循环，s

我有一个熊猫数据框：

我试图得出以下结论：

我可以用这段代码完成一些事情：

pd.concat([df['text'].reset_index(drop=True), df['text'].shift(-1).reset_index(drop=True)], axis=1)

但是，这不会基于

is_from_me

组合文本，其中组的文本与分隔原始字符串的换行符组合。这是一个过于简单的示例，可能有超过2行被分组到一行中

我试着想出一个简单的方法来定义这个分组，但我能管理的只是一个复杂的for循环，sorta用一种黑客的方式来完成这项工作。是否有我可以编写的聚合函数可以为我完成此任务？

使用-

input_ = df.groupby((df.is_from_me != df.is_from_me.shift()).cumsum())['text'].apply(lambda x: '\n'.join(x))
output = input_.shift(-1)
pd.concat([input_, output], axis=1)

输出

    text    text
is_from_me      
1   Happy birthday bud!!!   Thanks man!
2   Thanks man! Definitely would've come back had I thought ab...
3   Definitely would've come back had I thought ab...   Your good
4   Your good   Okay haha\nHave a good one
5   Okay haha\nHave a good one  Yea you too. What are you up to?
6   Yea you too. What are you up to?    No hw like I'm doing all day\nJust got up
7   No hw like I'm doing all day\nJust got up   Same here. I went to the football game last...
8   Same here. I went to the football game last...  I think I saw that in your story\nWin?
9   I think I saw that in your story\nWin?  Lost in last second
10  Lost in last second Aw. that sucks\nMeans it was a good game tho?
11  Aw. that sucks\nMeans it was a good game tho?   Really good game. They were on the 1/2 yard li...
12  Really good game. They were on the 1/2 yard li...   Dang
13  Dang    NaN

使用-

输出

    text    text
is_from_me      
1   Happy birthday bud!!!   Thanks man!
2   Thanks man! Definitely would've come back had I thought ab...
3   Definitely would've come back had I thought ab...   Your good
4   Your good   Okay haha\nHave a good one
5   Okay haha\nHave a good one  Yea you too. What are you up to?
6   Yea you too. What are you up to?    No hw like I'm doing all day\nJust got up
7   No hw like I'm doing all day\nJust got up   Same here. I went to the football game last...
8   Same here. I went to the football game last...  I think I saw that in your story\nWin?
9   I think I saw that in your story\nWin?  Lost in last second
10  Lost in last second Aw. that sucks\nMeans it was a good game tho?
11  Aw. that sucks\nMeans it was a good game tho?   Really good game. They were on the 1/2 yard li...
12  Really good game. They were on the 1/2 yard li...   Dang
13  Dang    NaN

您可以使用

pd.groupby

。输出看起来很难看，但它应该是您所需要的

a = df.groupby([df.is_from_me.diff().ne(0).cumsum()]).agg(lambda x: tuple(x))
a['output'] = a['text']
a['input'] = a.shift()['text']

输出

             input  \
is_from_me                                                      
1                                                         NaN   
2                                    (Happy birthday bud!!!,)   
3                                              (Thanks man!,)   
4           (Definitely would've come back had I thought a...   
5                                                (Your good,)   
6                                (Okay haha, Have a good one)   
7                         (Yea you too. What are you up to?,)   
8                 (No hw like I'm doing all day, Just got up)   
9           (Same here. I went to the football game last...,)   
10                   (I think I saw that in your story, Win?)   
11                                     (Lost in last second,)   
12            (Aw, that sucks, Means it was a good game tho?)   
13          (Really good game. They were on the 1/2 yard l...   

                                                       output  
is_from_me                                                     
1                                    (Happy birthday bud!!!,)  
2                                              (Thanks man!,)  
3           (Definitely would've come back had I thought a...  
4                                                (Your good,)  
5                                (Okay haha, Have a good one)  
6                         (Yea you too. What are you up to?,)  
7                 (No hw like I'm doing all day, Just got up)  
8           (Same here. I went to the football game last...,)  
9                    (I think I saw that in your story, Win?)  
10                                     (Lost in last second,)  
11            (Aw, that sucks, Means it was a good game tho?)  
12          (Really good game. They were on the 1/2 yard l...  
13                                                    (Dang,)

您可以使用

pd.groupby

。输出看起来很难看，但它应该是您所需要的

a = df.groupby([df.is_from_me.diff().ne(0).cumsum()]).agg(lambda x: tuple(x))
a['output'] = a['text']
a['input'] = a.shift()['text']

输出

             input  \
is_from_me                                                      
1                                                         NaN   
2                                    (Happy birthday bud!!!,)   
3                                              (Thanks man!,)   
4           (Definitely would've come back had I thought a...   
5                                                (Your good,)   
6                                (Okay haha, Have a good one)   
7                         (Yea you too. What are you up to?,)   
8                 (No hw like I'm doing all day, Just got up)   
9           (Same here. I went to the football game last...,)   
10                   (I think I saw that in your story, Win?)   
11                                     (Lost in last second,)   
12            (Aw, that sucks, Means it was a good game tho?)   
13          (Really good game. They were on the 1/2 yard l...   

                                                       output  
is_from_me                                                     
1                                    (Happy birthday bud!!!,)  
2                                              (Thanks man!,)  
3           (Definitely would've come back had I thought a...  
4                                                (Your good,)  
5                                (Okay haha, Have a good one)  
6                         (Yea you too. What are you up to?,)  
7                 (No hw like I'm doing all day, Just got up)  
8           (Same here. I went to the football game last...,)  
9                    (I think I saw that in your story, Win?)  
10                                     (Lost in last second,)  
11            (Aw, that sucks, Means it was a good game tho?)  
12          (Really good game. They were on the 1/2 yard l...  
13                                                    (Dang,)

这不处理我在问题中描述的集群逻辑。是的，我在与另一个答案进行速度比较，需要一些时间来运行测试这不处理我在问题中描述的集群逻辑。是的，我在与另一个答案进行速度比较，需要一些时间来运行测试为什么使用元组而不是

'\n'\n'.join（x）

？只是一个样式首选项，您的方法也有效：）为什么使用元组而不是

“\n”。join（x）

？只是一个样式首选项，您的方法也有效：）