python bin数据和返回bin中点（可能使用pandas.cut和qcut）_Python_Pandas_Binning

python bin数据和返回bin中点（可能使用pandas.cut和qcut）

python pandas

python bin数据和返回bin中点（可能使用pandas.cut和qcut）,python,pandas,binning,Python,Pandas,Binning,我是否可以使pandas cut/qcut函数返回bin端点或bin中点，而不是bin标签字符串目前 pd.cut(pd.Series(np.arange(11)), bins = 5) 0 (-0.01, 2] 1 (-0.01, 2] 2 (-0.01, 2] 3 (2, 4] 4 (2, 4] 5 (4, 6] 6 (4, 6] 7 (6, 8] 8 (6, 8] 9

我是否可以使pandas cut/qcut函数返回bin端点或bin中点，而不是bin标签字符串

目前

pd.cut(pd.Series(np.arange(11)), bins = 5)

0     (-0.01, 2]
1     (-0.01, 2]
2     (-0.01, 2]
3         (2, 4]
4         (2, 4]
5         (4, 6]
6         (4, 6]
7         (6, 8]
8         (6, 8]
9        (8, 10]
10       (8, 10]
dtype: category

使用类别/字符串值。我想要的是

数值表示箱子的边缘或中点。

正在进行一项“IntervalIndex”的工作，这将使这种操作非常简单

但是现在，您可以通过传递

retbins

参数来获得箱子，并计算中点

In [8]: s, bins = pd.cut(pd.Series(np.arange(11)), bins = 5, retbins=True)

In [11]: mid = [(a + b) /2 for a,b in zip(bins[:-1], bins[1:])]

In [13]: s.cat.rename_categories(mid)
Out[13]: 
0     0.995
1     0.995
2     0.995
3     3.000
4     3.000
5     5.000
6     5.000
7     7.000
8     7.000
9     9.000
10    9.000
dtype: category
Categories (5, float64): [0.995 < 3.000 < 5.000 < 7.000 < 9.000]

[8]中的

s，bin=pd.cut（pd.Series（np.arange（11）），bin=5，retbins=True）
在[11]中：mid=[（a+b）/2表示a，b表示zip中的（bin[：-1]，bin[1:]）]
在[13]中：s.cat.rename_类别（mid）
出[13]：
0     0.995
1     0.995
2     0.995
3     3.000
4     3.000
5     5.000
6     5.000
7     7.000
8     7.000
9     9.000
10    9.000
数据类型：类别
类别（5，64）：[0.995<3.000<5.000<7.000<9.000]

我知道这是一篇老文章，但无论如何我都会冒昧地回答

现在可以使用

左

和

右

访问分类间隔的端点

s = pd.cut(pd.Series(np.arange(11)), bins = 5)

mid = [(a.left + a.right)/2 for a in s]
Out[34]: [0.995, 0.995, 0.995, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]

由于间隔向左打开，向右关闭，“第一”间隔（从0开始的间隔）实际上从-0.01开始。要使用0作为左值获取中点，可以执行以下操作

mid_alt = [(a.left + a.right)/2 if a.left != -0.01 else a.right/2 for a in s]
Out[35]: [1.0, 1.0, 1.0, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]

或者，您可以说间隔向左关闭，向右打开

t = pd.cut(pd.Series(np.arange(11)), bins = 5, right=False)
Out[38]: 
0       [0.0, 2.0)
1       [0.0, 2.0)
2       [2.0, 4.0)
3       [2.0, 4.0)
4       [4.0, 6.0)
5       [4.0, 6.0)
6       [6.0, 8.0)
7       [6.0, 8.0)
8     [8.0, 10.01)
9     [8.0, 10.01)
10    [8.0, 10.01)

但是，正如您所看到的，您在最后一个时间间隔中遇到了相同的问题。

我注意到一个类别有一个

mid

属性，因此您可以通过

应用来计算中间：
In [1]: import pandas as pd
   ...: import numpy as np
   ...: df = pd.DataFrame({"val":np.arange(11)})
   ...: df["bins"] = pd.cut(df["val"], bins = 5)
   ...: df["bin_centres"] = df["bins"].apply(lambda x: x.mid)
   ...: df
Out[1]:
    val          bins bin_centres
0     0  (-0.01, 2.0]       0.995
1     1  (-0.01, 2.0]       0.995
2     2  (-0.01, 2.0]       0.995
3     3    (2.0, 4.0]       3.000
4     4    (2.0, 4.0]       3.000
5     5    (4.0, 6.0]       5.000
6     6    (4.0, 6.0]       5.000
7     7    (6.0, 8.0]       7.000
8     8    (6.0, 8.0]       7.000
9     9   (8.0, 10.0]       9.000
10   10   (8.0, 10.0]       9.000