Python 熊猫:向每个组添加行,直到满足条件
我有一个具有以下结构的时间序列数据帧:Python 熊猫:向每个组添加行,直到满足条件,python,pandas,numpy,Python,Pandas,Numpy,我有一个具有以下结构的时间序列数据帧: | ID | second | speaker1 | speaker2 | company | ... | |----|--------|----------|----------|---------|-----| | A | 1 | 1 | 1 | name1 | | | A | 2 | 1 | 1 | name1 | | | A | 3
| ID | second | speaker1 | speaker2 | company | ... |
|----|--------|----------|----------|---------|-----|
| A | 1 | 1 | 1 | name1 | |
| A | 2 | 1 | 1 | name1 | |
| A | 3 | 1 | 1 | name1 | |
| B | 1 | 1 | 1 | name2 | |
| B | 2 | 1 | 1 | name2 | |
| B | 3 | 1 | 1 | name2 | |
| B | 4 | 1 | 1 | name2 | |
| C | 1 | 1 | 1 | name3 | |
| C | 2 | 1 | 1 | name3 | |
*注意speaker1和speaker2可以是0或1,为了清晰起见,我在这里将all设置为1
我想向每个组添加行,直到每个组的行数相同。(其中行数=行数最多的ID)
对于每一个新行,我希望用0填充speaker1和speaker2列,同时保持其他列中该ID的值相同
因此,输出应为:
| ID | second | speaker1 | speaker2 | company | ... |
|:--:|:------:|:--------:|:--------:|:-------:|:---:|
| A | 1 | 1 | 1 | name1 | |
| A | 2 | 1 | 1 | name1 | |
| A | 3 | 1 | 1 | name1 | |
| A | 4 | 0 | 0 | name1 | |
| B | 1 | 1 | 1 | name2 | |
| B | 2 | 1 | 1 | name2 | |
| B | 3 | 1 | 1 | name2 | |
| B | 4 | 1 | 1 | name2 | |
| C | 1 | 1 | 1 | name3 | |
| C | 2 | 1 | 1 | name3 | |
| C | 3 | 0 | 0 | name3 | |
| C | 4 | 0 | 0 | name3 | |
到目前为止,我已经尝试了groupby和apply,但是发现它非常慢,因为我在这个数据框架中有许多行和列
def add_rows_sec(w):
'input: dataframe for grouped by ID, output: dataframe with added rows until max call length'
while w['second'].max() < clean_data['second'].max(): # if duration is less than max duration in full data set
last_row = w.iloc[-1]
last_row['second'] += 1
last_row['speaker1'] = 0
last_row['speaker2'] = 0
return w.append(last_row)
return w
df.groupby('ID').apply(add_rows_sec).reset_index(drop=True)
def添加行数秒(w):
'输入:按ID分组的数据帧,输出:添加行的数据帧,直到最大调用长度'
而w['second'].max()
有没有办法用numpy做到这一点?差不多
condition = w['second'].max() < df['second'].max()
choice = pd.Series([w.ID, w.second + 1, 0, 0, w.company...])
df = np.select(condition, choice, default = np.nan)
condition=w['second'].max()
非常感谢您的帮助 与熊猫不同的方法
ID
和second
groupby()
无循环
df = pd.DataFrame({"ID":["A","A","A","B","B","B","B","C","C"],"second":["1","2","3","1","2","3","4","1","2"],"speaker1":["1","1","1","1","1","1","1","1","1"],"speaker2":["1","1","1","1","1","1","1","1","1"],"company":["name1","name1","name1","name2","name2","name2","name2","name3","name3"]})
df2 = pd.DataFrame({"ID":df["ID"].unique()}).assign(foo=1).merge(\
pd.DataFrame({"second":df["second"].unique()}).assign(foo=1)).drop("foo", 1)\
.merge(df, on=["ID","second"], how="outer")
df2["company"] = df2["company"].fillna(method="ffill")
df2.fillna(0)
输出
ID second speaker1 speaker2 company
0 A 1 1 1 name1
1 A 2 1 1 name1
2 A 3 1 1 name1
3 A 4 0 0 name1
4 B 1 1 1 name2
5 B 2 1 1 name2
6 B 3 1 1 name2
7 B 4 1 1 name2
8 C 1 1 1 name3
9 C 2 1 1 name3
10 C 3 0 0 name3
11 C 4 0 0 name3
与熊猫不同的方法
ID
和second
groupby()
无循环
df = pd.DataFrame({"ID":["A","A","A","B","B","B","B","C","C"],"second":["1","2","3","1","2","3","4","1","2"],"speaker1":["1","1","1","1","1","1","1","1","1"],"speaker2":["1","1","1","1","1","1","1","1","1"],"company":["name1","name1","name1","name2","name2","name2","name2","name3","name3"]})
df2 = pd.DataFrame({"ID":df["ID"].unique()}).assign(foo=1).merge(\
pd.DataFrame({"second":df["second"].unique()}).assign(foo=1)).drop("foo", 1)\
.merge(df, on=["ID","second"], how="outer")
df2["company"] = df2["company"].fillna(method="ffill")
df2.fillna(0)
输出
ID second speaker1 speaker2 company
0 A 1 1 1 name1
1 A 2 1 1 name1
2 A 3 1 1 name1
3 A 4 0 0 name1
4 B 1 1 1 name2
5 B 2 1 1 name2
6 B 3 1 1 name2
7 B 4 1 1 name2
8 C 1 1 1 name3
9 C 2 1 1 name3
10 C 3 0 0 name3
11 C 4 0 0 name3
这是什么意思<代码>我想为每个唯一ID添加行,直到每个ID的行数等于行数最多的ID。基本上只是向每个组添加行,直到每个组的行数相同。(其中行数=行数最多的ID)这意味着什么<代码>我想为每个唯一ID添加行,直到每个ID的行数等于行数最多的ID。基本上只是向每个组添加行,直到每个组的行数相同。(其中行数=行数最多的ID)