Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/assembly/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 熊猫:向每个组添加行,直到满足条件_Python_Pandas_Numpy - Fatal编程技术网

Python 熊猫:向每个组添加行,直到满足条件

Python 熊猫:向每个组添加行,直到满足条件,python,pandas,numpy,Python,Pandas,Numpy,我有一个具有以下结构的时间序列数据帧: | ID | second | speaker1 | speaker2 | company | ... | |----|--------|----------|----------|---------|-----| | A | 1 | 1 | 1 | name1 | | | A | 2 | 1 | 1 | name1 | | | A | 3

我有一个具有以下结构的时间序列数据帧:

| ID | second | speaker1 | speaker2 | company | ... |
|----|--------|----------|----------|---------|-----|
|  A |    1   |     1    |     1    |  name1  |     |
|  A |    2   |     1    |     1    |  name1  |     |
|  A |    3   |     1    |     1    |  name1  |     |
|  B |    1   |     1    |     1    |  name2  |     |
|  B |    2   |     1    |     1    |  name2  |     |
|  B |    3   |     1    |     1    |  name2  |     |
|  B |    4   |     1    |     1    |  name2  |     |
|  C |    1   |     1    |     1    |  name3  |     |
|  C |    2   |     1    |     1    |  name3  |     |
*注意speaker1和speaker2可以是0或1,为了清晰起见,我在这里将all设置为1

我想向每个组添加行,直到每个组的行数相同。(其中行数=行数最多的ID)

对于每一个新行,我希望用0填充speaker1和speaker2列,同时保持其他列中该ID的值相同

因此,输出应为:

| ID | second | speaker1 | speaker2 | company | ... |
|:--:|:------:|:--------:|:--------:|:-------:|:---:|
|  A |    1   |     1    |     1    |  name1  |     |
|  A |    2   |     1    |     1    |  name1  |     |
|  A |    3   |     1    |     1    |  name1  |     |
|  A |    4   |     0    |     0    |  name1  |     |
|  B |    1   |     1    |     1    |  name2  |     |
|  B |    2   |     1    |     1    |  name2  |     |
|  B |    3   |     1    |     1    |  name2  |     |
|  B |    4   |     1    |     1    |  name2  |     |
|  C |    1   |     1    |     1    |  name3  |     |
|  C |    2   |     1    |     1    |  name3  |     |
|  C |    3   |     0    |     0    |  name3  |     |
|  C |    4   |     0    |     0    |  name3  |     |
到目前为止,我已经尝试了groupby和apply,但是发现它非常慢,因为我在这个数据框架中有许多行和列

def add_rows_sec(w):
    'input: dataframe for grouped by ID, output: dataframe with added rows until max call length'
    
    while w['second'].max() < clean_data['second'].max(): # if duration is less than max duration in full data set
        last_row = w.iloc[-1]
        last_row['second'] += 1
        last_row['speaker1'] = 0
        last_row['speaker2'] = 0
        return w.append(last_row)
    return w

df.groupby('ID').apply(add_rows_sec).reset_index(drop=True)
def添加行数秒(w):
'输入:按ID分组的数据帧,输出:添加行的数据帧,直到最大调用长度'
而w['second'].max()
有没有办法用numpy做到这一点?差不多

condition = w['second'].max() < df['second'].max()
choice = pd.Series([w.ID, w.second + 1, 0, 0, w.company...])
df = np.select(condition, choice, default = np.nan)
condition=w['second'].max()

非常感谢您的帮助

与熊猫不同的方法

  • 构造一个数据帧,它是
    ID
    second
  • 外部将其连接回原始数据帧
  • 根据您的等级库填充缺少的值
  • groupby()
    无循环

    df = pd.DataFrame({"ID":["A","A","A","B","B","B","B","C","C"],"second":["1","2","3","1","2","3","4","1","2"],"speaker1":["1","1","1","1","1","1","1","1","1"],"speaker2":["1","1","1","1","1","1","1","1","1"],"company":["name1","name1","name1","name2","name2","name2","name2","name3","name3"]})
    
    df2 = pd.DataFrame({"ID":df["ID"].unique()}).assign(foo=1).merge(\
        pd.DataFrame({"second":df["second"].unique()}).assign(foo=1)).drop("foo", 1)\
        .merge(df, on=["ID","second"], how="outer")
    
    df2["company"] = df2["company"].fillna(method="ffill")
    df2.fillna(0)
    
    输出

        ID  second  speaker1    speaker2    company
    0   A   1   1   1   name1
    1   A   2   1   1   name1
    2   A   3   1   1   name1
    3   A   4   0   0   name1
    4   B   1   1   1   name2
    5   B   2   1   1   name2
    6   B   3   1   1   name2
    7   B   4   1   1   name2
    8   C   1   1   1   name3
    9   C   2   1   1   name3
    10  C   3   0   0   name3
    11  C   4   0   0   name3
    
    

    与熊猫不同的方法

  • 构造一个数据帧,它是
    ID
    second
  • 外部将其连接回原始数据帧
  • 根据您的等级库填充缺少的值
  • groupby()
    无循环

    df = pd.DataFrame({"ID":["A","A","A","B","B","B","B","C","C"],"second":["1","2","3","1","2","3","4","1","2"],"speaker1":["1","1","1","1","1","1","1","1","1"],"speaker2":["1","1","1","1","1","1","1","1","1"],"company":["name1","name1","name1","name2","name2","name2","name2","name3","name3"]})
    
    df2 = pd.DataFrame({"ID":df["ID"].unique()}).assign(foo=1).merge(\
        pd.DataFrame({"second":df["second"].unique()}).assign(foo=1)).drop("foo", 1)\
        .merge(df, on=["ID","second"], how="outer")
    
    df2["company"] = df2["company"].fillna(method="ffill")
    df2.fillna(0)
    
    输出

        ID  second  speaker1    speaker2    company
    0   A   1   1   1   name1
    1   A   2   1   1   name1
    2   A   3   1   1   name1
    3   A   4   0   0   name1
    4   B   1   1   1   name2
    5   B   2   1   1   name2
    6   B   3   1   1   name2
    7   B   4   1   1   name2
    8   C   1   1   1   name3
    9   C   2   1   1   name3
    10  C   3   0   0   name3
    11  C   4   0   0   name3
    
    

    这是什么意思<代码>我想为每个唯一ID添加行,直到每个ID的行数等于行数最多的ID。
    基本上只是向每个组添加行,直到每个组的行数相同。(其中行数=行数最多的ID)这意味着什么<代码>我想为每个唯一ID添加行,直到每个ID的行数等于行数最多的ID。
    基本上只是向每个组添加行,直到每个组的行数相同。(其中行数=行数最多的ID)