Python 如何在一个数据帧中创建具有相同数量观测值的容器？_Python_Pandas

Python 如何在一个数据帧中创建具有相同数量观测值的容器？

python pandas

Python 如何在一个数据帧中创建具有相同数量观测值的容器？,python,pandas,Python,Pandas,我试图在数据框中创建一列，描述观察所属的组或bin。其思想是根据某个列对数据帧进行排序，然后开发另一个列，表示观测值属于哪个bin。如果我想要十分位数，那么我应该能够告诉一个函数我想要10个相等（或接近相等）的组我试过了，但那个只是给出了箱子的上限和下限的元组。我想要1、2、3、4……等等。以下面的例子为例 import numpy as np import pandas as pd x = [1,2,3,4,5,6,7,8,5,45,64545,65,6456,564] y = np.ra

我试图在数据框中创建一列，描述观察所属的组或

bin

。其思想是根据某个列对数据帧进行排序，然后开发另一个列，表示观测值属于哪个bin。如果我想要十分位数，那么我应该能够告诉一个函数我想要10个相等（或接近相等）的组

我试过了，但那个只是给出了箱子的上限和下限的元组。我想要1、2、3、4……等等。以下面的例子为例

import numpy as np
import pandas as pd

x = [1,2,3,4,5,6,7,8,5,45,64545,65,6456,564]
y = np.random.rand(len(x))

df_dict = {'x': x, 'y': y}
df = pd.DataFrame(df_dict)

这给出了14个观测值的df。我怎样才能得到5个相等的垃圾箱

预期结果如下：

        x         y  group
0       1  0.926273      1
1       2  0.678101      1
2       3  0.636875      1
3       4  0.802590      2
4       5  0.494553      2
5       6  0.874876      2
6       7  0.607902      3
7       8  0.028737      3
8       5  0.493545      3
9      45  0.498140      4
10  64545  0.938377      4
11     65  0.613015      4
12   6456  0.288266      5
13    564  0.917817      5

您可以使用平均分割，分配组，然后使用重新组合：

bin=5
拆分=np.数组\u拆分（df、BIN）
对于范围内的i（len（splits））：
拆分[i]['group']=i+1
df=局部混凝土（劈裂）

或作为一个衬里使用：

df=pd.concat（[d.assign（group=i+1）表示枚举中的i，d（np.array_split（df，bin）））

按

行分组，然后查找

N组

df['group']=df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1



     x      y        group
0       1  0.548801      1
1       2  0.096620      1
2       3  0.713771      1
3       4  0.922987      2
4       5  0.283689      2
5       6  0.807755      2
6       7  0.592864      3
7       8  0.670315      3
8       5  0.034549      3
9      45  0.355274      4
10  64545  0.239373      4
11     65  0.156208      4
12   6456  0.419990      5
13    564  0.248278      5

另一个选项是从以下内容生成索引列表：

def近_分割（基本、数量箱）：
商，余数=divmod（基数，num_bin）
返回[商+1]*余数+[商]*（num_bins-余数）
垃圾箱=5
df['group']=[i+1表示枚举中的i，v（接近分割（len（df），bin）），表示范围（v）]
打印（df）

输出：

        x         y  group
0       1  0.313614      1
1       2  0.765079      1
2       3  0.153851      1
3       4  0.792098      2
4       5  0.123700      2
5       6  0.239107      2
6       7  0.133665      3
7       8  0.979318      3
8       5  0.781948      3
9      45  0.264344      4
10  64545  0.495561      4
11     65  0.504734      4
12   6456  0.766627      5
13    564  0.428423      5

以下是一种“手动”计算垃圾箱范围的方法，该方法基于所请求的垃圾箱编号

垃圾箱

：

bins = 5

l = len(df)
minbinlen = l // bins
remainder = l % bins
repeats = np.repeat(minbinlen, bins)
repeats[:remainder] += 1
group = np.repeat(range(bins), repeats) + 1

df['group'] = group

结果:

        x         y  group
0       1  0.205168      1
1       2  0.105466      1
2       3  0.545794      1
3       4  0.639346      2
4       5  0.758056      2
5       6  0.982090      2
6       7  0.942849      3
7       8  0.284520      3
8       5  0.491151      3
9      45  0.731265      4
10  64545  0.072668      4
11     65  0.601416      4
12   6456  0.239454      5
13    564  0.345006      5

这似乎遵循了

np.array\u split

的拆分逻辑（即，尝试均匀拆分存储箱，但如果不可能，则添加到早期的存储箱）

虽然代码不那么简洁，但它不使用任何循环，因此理论上，如果数据量较大，它应该更快

就因为我好奇，就把这个测试留在这里

尼斯+1，更容易！非常感谢。有没有一种逻辑来选择垃圾箱的数量，而不是告诉python要除以多少？因为这可能是动态的，我不知道在任何给定的时间内我得到多少行time@Jordan，据我所知不是。如果他们在每次观察或信号上都是不同的组，那就容易多了。我也对时间安排很好奇，所以谢谢你的演示！和尼斯

重复解决方案，+1
        x         y  group
0       1  0.205168      1
1       2  0.105466      1
2       3  0.545794      1
3       4  0.639346      2
4       5  0.758056      2
5       6  0.982090      2
6       7  0.942849      3
7       8  0.284520      3
8       5  0.491151      3
9      45  0.731265      4
10  64545  0.072668      4
11     65  0.601416      4
12   6456  0.239454      5
13    564  0.345006      5

import numpy as np
import pandas as pd
import perfplot

def make_data(n):
    x = np.random.rand(n)
    y = np.random.rand(n)
    df_dict = {'x': x, 'y': y}
    df = pd.DataFrame(df_dict)

    return df

def repeat(df, bins=5):
    l = len(df)
    minbinlen = l // bins
    remainder = l % bins
    repeats = np.repeat(minbinlen, bins)
    repeats[:remainder] += 1
    group = np.repeat(range(bins), repeats) + 1

    return group

def near_split(base, num_bins):
    quotient, remainder = divmod(base, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)

def array_split(df, bins=5):
    splits = np.array_split(df, bins)

    for i in range(len(splits)):
        splits[i]['group'] = i + 1

    return pd.concat(splits)

perfplot.show(
    setup = lambda n : make_data(n),
    kernels = [
        lambda df: repeat(df),
        lambda df: [i + 1 for i, v in enumerate(near_split(len(df), 5)) for _ in range(v)],
        lambda df: df.groupby(np.arange(len(df.index))//3,axis=0).ngroup()+1,
        lambda df: array_split(df)
        ],
    labels=['repeat', 'near_split', 'groupby', 'array_split'],
    n_range=[2 ** k for k in range(25)],
    equality_check=None)