Python 如何通过Pandas cut（）命令和自动生成的类别和箱子创建个性化的桶列？_Python_Pandas_Dataframe_Data Cleaning

Python 如何通过Pandas cut（）命令和自动生成的类别和箱子创建个性化的桶列？

python pandas dataframe

Python 如何通过Pandas cut（）命令和自动生成的类别和箱子创建个性化的桶列？,python,pandas,dataframe,data-cleaning,Python,Pandas,Dataframe,Data Cleaning,我有一个包含多列的数据框架。其中之一是年。从本专栏中，我想创建一个具有分类值（我猜therm是bucket）的新的，自动生成bucket。结果应该是这样的： year_gr year other_cols A (1909 - 1917) 1911 abc B (1921 - 1930) 1923 def C (1932 - 1941) 1935 ghi 我通过以下方式创建了与之相近的内容： year_gr = pd.cut(df.year, 10, la

我有一个包含多列的数据框架。其中之一是

年

。从本专栏中，我想创建一个具有分类值（我猜therm是bucket）的新的，自动生成bucket。结果应该是这样的：

year_gr         year    other_cols
A (1909 - 1917) 1911    abc
B (1921 - 1930) 1923    def
C (1932 - 1941) 1935    ghi

我通过以下方式创建了与之相近的内容：

year_gr = pd.cut(df.year, 10, labels=[
   'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])
df['year_gr'] = year_gr

df.head()

year_gr year    other_cols
A       1911    abc
B       1923    def
C       1935    ghi

但是如何将由

pd.cut

自动生成的rages连接到我的

year\u gr

变量？我看到我们可以在

cut

命令中添加参数

retbins=True

来提取箱子，但我没有利用它

谢谢

pd.cut将生成由间隔对象填充的分类对象。我们使用它们的.left和.right属性来创建指定的字符串

    import numpy as np, pandas as pd
    import string

    # test data:
    df=pd.DataFrame({"year":[1911,1923,1935,1911],"other_cols":["abc","def","ghi","jkl"]})
  Out:
           year other_cols
    0  1911        abc
    1  1923        def
    2  1935        ghi
    3  1911        jkl

    #create the intervals:
    cats=pd.cut(df.year,10)

    Out: cats.dtypes.categories
    IntervalIndex([(1910.976, 1913.4], (1913.4, 1915.8], (1915.8, 1918.2],...

    # char generator:
    gchar=(ch for ch in string.ascii_uppercase)
    dlbls= { iv:next(gchar) for iv in cats.dtypes.categories } #EDIT1
    # get the intervals and convert them to the specified strings:
    df["year_gr"]=[ f"{dlbls[iv]} ({int(np.round(iv.left))} - {int(np.round(iv.right))})" for iv in cats ] #EDIT1
   Out:
          year other_cols          year_gr
    0  1911        abc  A (1911 - 1913)
    1  1923        def  B (1921 - 1923)
    2  1935        ghi  C (1933 - 1935)
    3  1911        jkl  A (1911 - 1913)

    # align the columns:
    df= df.reindex(["year_gr","year","other_cols"], axis=1)
   Out:
               year_gr  year other_cols
    0  A (1911 - 1913)  1911        abc
    1  B (1921 - 1923)  1923        def
    2  C (1933 - 1935)  1935        ghi
    3  A (1911 - 1913)  1911        jkl

谢谢这几乎奏效了，但当我在同一个小组工作多年时，收到的信件就不一样了。（并且溢出for中的gchar，抛出一个

StopIteration:

错误。

--->8df[“year_gr”]=[f”{next（gchar）}（{int（np.round（iv.left））}-{int（np.round（iv.right））}）“for cat中的iv]

不使用

gchar

也可以。我想我们需要在那里添加

if

条件，将字母添加到10个类别中的一个。您可以通过以下方式重现类别问题：

df=pd.DataFrame（{“year”：[1911191219231935193619462012001719841986]，“other_cols:[“abc”，“def”，“ghi”，“asdf”，“asdf”，“asdf”，“asdf”，“asdf”，“asdf”，“asdf”，“asdf”、“asdf”、“asdf”、“asdf”]}）和停止迭代
错误，只需将行数复制三倍。@Bruno Ambrozio你说得对！我编辑了上面的代码，创建了一个字典（'dlbls'），并用它来构建“year\u gr”列。它看起来太复杂了，但我没有其他想法。很好！你做到了。谢谢你！：“`df[“year\u gr”]计算：6168 df[“年份”]。唯一计算：[H（1985-1995），D（1941-1952），B（1920-1931），G（1974-1985），A（1909-1920），F（1963-1974），E（1952-1963），C（1931-1941），J（2006-2017），I（1995-2006）]类别（10，对象）：[H（1985-1995），D（1941-1952），B（1920-1931），G（1974-1985），…，E（1952-1963），C（1931-1941），J（2006-2017），I（1995-2006）“``顺便说一句，你知道如何找到最适合箱子编号的方法吗？我使用10是因为df[“出生年份”].max（）-df[“出生年份”].min（）
结果是108
，因此我由10个小组任意决定……我想知道是否有更聪明的方法来做到这一点，因为我的数据集可能不平衡……无论如何——也许这是另一篇文章的问题：）