python中多级分类数据的描述性统计

python中多级分类数据的描述性统计,python,pandas,statistics,categorical-data,Python,Pandas,Statistics,Categorical Data,下面是一个包含三列的df示例,每个列都有多级分类数据。我想计算列中每个级别的三列中的一些描述性统计数据-例如,每个位置和状态中每个年龄组的人数,包括计数、比例和标准差(我认为这里实际上应该是一个置信区间)。但我不知道如何优雅地完成它。非常感谢您的建议,非常感谢 birth_year = pd.DataFrame(([random.randint(1900,2000) for x in range(50)]), columns = ['year']) from datetime import d

下面是一个包含三列的df示例,每个列都有多级分类数据。我想计算列中每个级别的三列中的一些描述性统计数据-例如,每个位置和状态中每个年龄组的人数,包括计数、比例和标准差(我认为这里实际上应该是一个置信区间)。但我不知道如何优雅地完成它。非常感谢您的建议,非常感谢

birth_year = pd.DataFrame(([random.randint(1900,2000) for x in range(50)]), columns = ['year'])

from datetime import date

def age(df,col):
    today = date.today()
    age = today.year - df[col]
    bins = [18,30,40,50,60,70,120]
    labs = ['-30','30-39','40-49','50-59','60-69','70+']
    group = pd.cut(age, bins, labels = labs)
    return(group)

birth_year.loc[:,'age_bin'] = age(birth_year,'year')


location = pd.DataFrame((Rand(1, 6, 50)), columns = ['location'])

def label_loc (row):
    if row['location'] == 1 :
        return 'england'
    if row['location'] == 2 :
        return 'ireland'
    if row['location'] == 3:
        return 'scotland'
    if row['location']  == 4:
        return 'wales'
    if row['location']  == 5:
        return 'jersey'
    if row['location']  == 6:
        return 'gurnsey'
    return 'Other'

location = location.apply(lambda row: label_loc(row), axis=1)


def Rand(start, end, num):
    out = []
    for x in range(num):
        out.append(random.randint(start, end))
    return out


status = pd.DataFrame((Rand(1, 6, 50)), columns = ['status'])

def label_stat (row):
    if row['status'] == 1 :
        return 'married'
    if row['status'] == 2 :
        return 'divorced'
    if row['status'] == 3:
        return 'single'
    if row['status']  == 4:
        return 'window'
    return 'Other'

status = status.apply(lambda row: label_stat(row), axis=1)


df = pd.DataFrame(list(zip(birth_year["age_bin"], status, location)), columns =['year', 'gender', 'ethnicity'])
(有关稍微重写的设置示例,请参见。)

让我们以你为例:

每个位置和状态中每个年龄组的人数

如果您有一个连续变量,例如
year
,您可以简单地告诉
groupby().agg()
您想要的平均统计数据:

print(df.groupby(['location', 'status'])['year'].agg(['mean', 'std']))

                          mean        std
location status                          
england  Other     1961.000000  16.792856
         divorced  1934.666667  30.270998
         married   1917.000000        NaN
         single    1907.000000        NaN
         window    1962.600000  34.011763
ireland  Other     1982.000000        NaN
         divorced  1949.750000  37.303932
         married   1991.000000        NaN
         single    1986.500000   2.121320
         window    1965.500000   3.535534
jersey   Other     1939.800000  26.204961
         divorced  1984.000000        NaN
         married   1986.000000        NaN
         single    1942.500000  54.447222
scotland Other     1942.666667  12.701706
         divorced  1946.000000  49.497475
         married   1914.000000        NaN
         single    1968.000000        NaN
         window    1933.500000  24.748737
wales    Other     1950.666667  39.526363
         divorced  1978.000000        NaN
         married   1959.000000  52.325902
         single    1929.000000        NaN
         window    1990.000000        NaN
对于分类值,您可以使用
value\u counts()
对其进行计数,这将添加一个额外的索引级别(您可以取消堆叠):

如果需要每个类别的平均值,可以除以组大小,即
grouped\u age\u bin.size()

现在,通过人口规模和总数,您可以计算置信区间。也可以进行简单的字符串聚合。要同时获得总体大小和总数,我将使用
pd.DataFrame.transform
+
pd.Series.combined
,这样您只需编写一个lambda来获取类别中的数字和总数:

print(counts.transform(pd.Series.combine, 'index', grouped_age_bin.size(), lambda num, tot: f'{100 * num / tot:.1f}% (n={num})'))

age_bin                    -30        30-39        40-49         50-59         60-69           70+
location status                                                                                   
england  Other      0.0% (n=0)   0.0% (n=0)  50.0% (n=1)    0.0% (n=0)    0.0% (n=0)   50.0% (n=1)
         divorced   0.0% (n=0)  50.0% (n=1)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)   50.0% (n=1)
         married   33.3% (n=1)   0.0% (n=0)  33.3% (n=1)    0.0% (n=0)    0.0% (n=0)   33.3% (n=1)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=1)
         window     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=2)
ireland  Other      0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=2)
         divorced   0.0% (n=0)   0.0% (n=0)   0.0% (n=0)   50.0% (n=1)    0.0% (n=0)   50.0% (n=1)
         married    0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)  100.0% (n=2)    0.0% (n=0)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=1)
         window    33.3% (n=1)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)   66.7% (n=2)
jersey   Other      0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)  100.0% (n=1)    0.0% (n=0)
         married    0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=1)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)  100.0% (n=1)    0.0% (n=0)    0.0% (n=0)
scotland Other      0.0% (n=0)   0.0% (n=0)  50.0% (n=1)    0.0% (n=0)    0.0% (n=0)   50.0% (n=1)
         divorced   0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=3)
         married    0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=2)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=3)
         window    25.0% (n=1)   0.0% (n=0)   0.0% (n=0)   25.0% (n=1)    0.0% (n=0)   50.0% (n=2)
wales    Other      0.0% (n=0)   0.0% (n=0)   0.0% (n=0)  100.0% (n=1)    0.0% (n=0)    0.0% (n=0)
         divorced  16.7% (n=1)   0.0% (n=0)  33.3% (n=2)    0.0% (n=0)    0.0% (n=0)   50.0% (n=3)
         married    0.0% (n=0)  33.3% (n=1)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)   66.7% (n=2)
         single     0.0% (n=0)   0.0% (n=0)  33.3% (n=1)   33.3% (n=1)    0.0% (n=0)   33.3% (n=1)

顺便说一句,这行中的
Rand
是什么
location=pd.DataFrame((Rand(1,6,50)),columns=['location'])
?@AnuragDabas很好地发现了对不起-这是我添加的一个函数,用于控制输入数据的传播。感谢您指出outbtw,而不是定义一个函数,然后使用apply()方法,您只需创建一个dict并映射这些函数values@AnuragDabas谢谢你的提示!你介意给我看一下吗?创建一本字典
d={1:'英格兰',2:'爱尔兰',3:'苏格兰',4:'威尔士',5:'泽西',6:'古尔西'}
最后使用
map()
fillna()
i.e
location['location']=location['location']
print(counts.div(grouped_age_bin.size(), axis='index'))

age_bin                 -30  30-39     40-49     50-59  60-69       70+
location status                                                        
england  Other     0.000000    0.0  0.000000  0.000000   0.00  1.000000
         married   0.500000    0.0  0.000000  0.000000   0.00  0.500000
         single    0.000000    0.0  0.000000  0.000000   0.00  1.000000
         window    0.250000    0.0  0.000000  0.000000   0.25  0.500000
ireland  Other     0.000000    0.0  0.000000  0.000000   0.00  1.000000
         married   0.000000    0.0  0.000000  0.000000   0.00  1.000000
         single    0.000000    0.0  0.000000  0.000000   1.00  0.000000
         window    0.000000    0.0  0.333333  0.333333   0.00  0.333333
jersey   Other     0.000000    0.0  1.000000  0.000000   0.00  0.000000
         divorced  0.000000    0.0  1.000000  0.000000   0.00  0.000000
         married   0.000000    0.0  0.000000  0.000000   0.00  1.000000
         single    0.000000    0.0  0.200000  0.400000   0.20  0.200000
         window    0.000000    0.5  0.000000  0.000000   0.00  0.500000
scotland divorced  0.333333    0.0  0.000000  0.000000   0.00  0.666667
         married   0.000000    0.0  0.333333  0.333333   0.00  0.333333
         single    0.000000    0.5  0.000000  0.000000   0.00  0.500000
         window    0.000000    0.0  0.500000  0.000000   0.00  0.500000
wales    Other     0.000000    0.5  0.000000  0.000000   0.00  0.500000
         divorced  0.000000    0.0  0.000000  0.000000   0.00  1.000000
         married   0.500000    0.0  0.000000  0.000000   0.00  0.500000
         single    0.000000    0.0  0.000000  0.000000   0.00  1.000000
         window    0.500000    0.0  0.500000  0.000000   0.00  0.000000
print(counts.transform(pd.Series.combine, 'index', grouped_age_bin.size(), lambda num, tot: f'{100 * num / tot:.1f}% (n={num})'))

age_bin                    -30        30-39        40-49         50-59         60-69           70+
location status                                                                                   
england  Other      0.0% (n=0)   0.0% (n=0)  50.0% (n=1)    0.0% (n=0)    0.0% (n=0)   50.0% (n=1)
         divorced   0.0% (n=0)  50.0% (n=1)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)   50.0% (n=1)
         married   33.3% (n=1)   0.0% (n=0)  33.3% (n=1)    0.0% (n=0)    0.0% (n=0)   33.3% (n=1)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=1)
         window     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=2)
ireland  Other      0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=2)
         divorced   0.0% (n=0)   0.0% (n=0)   0.0% (n=0)   50.0% (n=1)    0.0% (n=0)   50.0% (n=1)
         married    0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)  100.0% (n=2)    0.0% (n=0)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=1)
         window    33.3% (n=1)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)   66.7% (n=2)
jersey   Other      0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)  100.0% (n=1)    0.0% (n=0)
         married    0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=1)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)  100.0% (n=1)    0.0% (n=0)    0.0% (n=0)
scotland Other      0.0% (n=0)   0.0% (n=0)  50.0% (n=1)    0.0% (n=0)    0.0% (n=0)   50.0% (n=1)
         divorced   0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=3)
         married    0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=2)
         single     0.0% (n=0)   0.0% (n=0)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)  100.0% (n=3)
         window    25.0% (n=1)   0.0% (n=0)   0.0% (n=0)   25.0% (n=1)    0.0% (n=0)   50.0% (n=2)
wales    Other      0.0% (n=0)   0.0% (n=0)   0.0% (n=0)  100.0% (n=1)    0.0% (n=0)    0.0% (n=0)
         divorced  16.7% (n=1)   0.0% (n=0)  33.3% (n=2)    0.0% (n=0)    0.0% (n=0)   50.0% (n=3)
         married    0.0% (n=0)  33.3% (n=1)   0.0% (n=0)    0.0% (n=0)    0.0% (n=0)   66.7% (n=2)
         single     0.0% (n=0)   0.0% (n=0)  33.3% (n=1)   33.3% (n=1)    0.0% (n=0)   33.3% (n=1)