Python 数据帧列上的条件聚合与';n';一行一行
我有一个输入数据框,它包含以下内容:Python 数据帧列上的条件聚合与';n';一行一行,python,pandas,dataframe,aggregation,Python,Pandas,Dataframe,Aggregation,我有一个输入数据框,它包含以下内容: NAME TEXT START END Tim Tim Wagner is a teacher. 10 20.5 Tim He is from Cleveland, Ohio. 20.5 40 Frank Frank is a musician
NAME TEXT START END
Tim Tim Wagner is a teacher. 10 20.5
Tim He is from Cleveland, Ohio. 20.5 40
Frank Frank is a musician. 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. 62 70
Frank He performed at the Carnegie Hall last year. 70 85
Frank It was fantastic listening to him. 85 90
Frank I really enjoyed 90 93
希望输出数据帧如下所示:
NAME TEXT START END
Tim Tim Wagner is a teacher. He is from Cleveland, Ohio. 10 40
Frank Frank is a musician 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. 62 85
Frank It was fantastic listening to him. I really enjoyed 85 93
我当前的代码:
grp = (df['NAME'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT','START','END']\
.agg({'TEXT':lambda x: ' '.join(x), 'START': 'min', 'END':'max'})\
.reset_index().drop('group', axis=1)
这将最后4行合并为一行。相反,我只想组合2行(比如任意n行),即使“NAME”具有相同的值
谢谢你在这方面的帮助
谢谢您可以通过
grp
进行分组,以获得组内的相对块:
blocks = df.NAME.ne(df.NAME.shift()).cumsum()
(df.groupby([blocks, df.groupby(blocks).cumcount()//2])
.agg({'NAME':'first', 'TEXT':' '.join,
'START':'min', 'END':'max'})
)
输出:
NAME TEXT START END
NAME
1 0 Tim Tim Wagner is a teacher. He is from Cleveland,... 10.0 40.0
2 0 Frank Frank is a musician. 40.0 50.0
3 0 Tim He like to travel with his family 50.0 62.0
4 0 Frank He is a performing artist who plays the cello.... 62.0 85.0
1 Frank It was fantastic listening to him. I really en... 85.0 93.0