python中多级分类数据的描述性统计
下面是一个包含三列的df示例,每个列都有多级分类数据。我想计算列中每个级别的三列中的一些描述性统计数据-例如,每个位置和状态中每个年龄组的人数,包括计数、比例和标准差(我认为这里实际上应该是一个置信区间)。但我不知道如何优雅地完成它。非常感谢您的建议,非常感谢python中多级分类数据的描述性统计,python,pandas,statistics,categorical-data,Python,Pandas,Statistics,Categorical Data,下面是一个包含三列的df示例,每个列都有多级分类数据。我想计算列中每个级别的三列中的一些描述性统计数据-例如,每个位置和状态中每个年龄组的人数,包括计数、比例和标准差(我认为这里实际上应该是一个置信区间)。但我不知道如何优雅地完成它。非常感谢您的建议,非常感谢 birth_year = pd.DataFrame(([random.randint(1900,2000) for x in range(50)]), columns = ['year']) from datetime import d
birth_year = pd.DataFrame(([random.randint(1900,2000) for x in range(50)]), columns = ['year'])
from datetime import date
def age(df,col):
today = date.today()
age = today.year - df[col]
bins = [18,30,40,50,60,70,120]
labs = ['-30','30-39','40-49','50-59','60-69','70+']
group = pd.cut(age, bins, labels = labs)
return(group)
birth_year.loc[:,'age_bin'] = age(birth_year,'year')
location = pd.DataFrame((Rand(1, 6, 50)), columns = ['location'])
def label_loc (row):
if row['location'] == 1 :
return 'england'
if row['location'] == 2 :
return 'ireland'
if row['location'] == 3:
return 'scotland'
if row['location'] == 4:
return 'wales'
if row['location'] == 5:
return 'jersey'
if row['location'] == 6:
return 'gurnsey'
return 'Other'
location = location.apply(lambda row: label_loc(row), axis=1)
def Rand(start, end, num):
out = []
for x in range(num):
out.append(random.randint(start, end))
return out
status = pd.DataFrame((Rand(1, 6, 50)), columns = ['status'])
def label_stat (row):
if row['status'] == 1 :
return 'married'
if row['status'] == 2 :
return 'divorced'
if row['status'] == 3:
return 'single'
if row['status'] == 4:
return 'window'
return 'Other'
status = status.apply(lambda row: label_stat(row), axis=1)
df = pd.DataFrame(list(zip(birth_year["age_bin"], status, location)), columns =['year', 'gender', 'ethnicity'])
(有关稍微重写的设置示例,请参见。)
让我们以你为例:
每个位置和状态中每个年龄组的人数
如果您有一个连续变量,例如year
,您可以简单地告诉groupby().agg()
您想要的平均统计数据:
print(df.groupby(['location', 'status'])['year'].agg(['mean', 'std']))
mean std
location status
england Other 1961.000000 16.792856
divorced 1934.666667 30.270998
married 1917.000000 NaN
single 1907.000000 NaN
window 1962.600000 34.011763
ireland Other 1982.000000 NaN
divorced 1949.750000 37.303932
married 1991.000000 NaN
single 1986.500000 2.121320
window 1965.500000 3.535534
jersey Other 1939.800000 26.204961
divorced 1984.000000 NaN
married 1986.000000 NaN
single 1942.500000 54.447222
scotland Other 1942.666667 12.701706
divorced 1946.000000 49.497475
married 1914.000000 NaN
single 1968.000000 NaN
window 1933.500000 24.748737
wales Other 1950.666667 39.526363
divorced 1978.000000 NaN
married 1959.000000 52.325902
single 1929.000000 NaN
window 1990.000000 NaN
对于分类值,您可以使用value\u counts()
对其进行计数,这将添加一个额外的索引级别(您可以取消堆叠):
如果需要每个类别的平均值,可以除以组大小,即grouped\u age\u bin.size()
:
现在,通过人口规模和总数,您可以计算置信区间。也可以进行简单的字符串聚合。要同时获得总体大小和总数,我将使用pd.DataFrame.transform
+pd.Series.combined
,这样您只需编写一个lambda来获取类别中的数字和总数:
print(counts.transform(pd.Series.combine, 'index', grouped_age_bin.size(), lambda num, tot: f'{100 * num / tot:.1f}% (n={num})'))
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
married 33.3% (n=1) 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 33.3% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
ireland Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 50.0% (n=1)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2) 0.0% (n=0)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
jersey Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
scotland Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
window 25.0% (n=1) 0.0% (n=0) 0.0% (n=0) 25.0% (n=1) 0.0% (n=0) 50.0% (n=2)
wales Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
divorced 16.7% (n=1) 0.0% (n=0) 33.3% (n=2) 0.0% (n=0) 0.0% (n=0) 50.0% (n=3)
married 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
single 0.0% (n=0) 0.0% (n=0) 33.3% (n=1) 33.3% (n=1) 0.0% (n=0) 33.3% (n=1)
顺便说一句,这行中的
Rand
是什么location=pd.DataFrame((Rand(1,6,50)),columns=['location'])
?@AnuragDabas很好地发现了对不起-这是我添加的一个函数,用于控制输入数据的传播。感谢您指出outbtw,而不是定义一个函数,然后使用apply()方法,您只需创建一个dict并映射这些函数values@AnuragDabas谢谢你的提示!你介意给我看一下吗?创建一本字典d={1:'英格兰',2:'爱尔兰',3:'苏格兰',4:'威尔士',5:'泽西',6:'古尔西'}
最后使用map()
和fillna()
i.elocation['location']=location['location']
print(counts.div(grouped_age_bin.size(), axis='index'))
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.500000 0.0 0.000000 0.000000 0.00 0.500000
single 0.000000 0.0 0.000000 0.000000 0.00 1.000000
window 0.250000 0.0 0.000000 0.000000 0.25 0.500000
ireland Other 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.000000 0.0 0.000000 0.000000 0.00 1.000000
single 0.000000 0.0 0.000000 0.000000 1.00 0.000000
window 0.000000 0.0 0.333333 0.333333 0.00 0.333333
jersey Other 0.000000 0.0 1.000000 0.000000 0.00 0.000000
divorced 0.000000 0.0 1.000000 0.000000 0.00 0.000000
married 0.000000 0.0 0.000000 0.000000 0.00 1.000000
single 0.000000 0.0 0.200000 0.400000 0.20 0.200000
window 0.000000 0.5 0.000000 0.000000 0.00 0.500000
scotland divorced 0.333333 0.0 0.000000 0.000000 0.00 0.666667
married 0.000000 0.0 0.333333 0.333333 0.00 0.333333
single 0.000000 0.5 0.000000 0.000000 0.00 0.500000
window 0.000000 0.0 0.500000 0.000000 0.00 0.500000
wales Other 0.000000 0.5 0.000000 0.000000 0.00 0.500000
divorced 0.000000 0.0 0.000000 0.000000 0.00 1.000000
married 0.500000 0.0 0.000000 0.000000 0.00 0.500000
single 0.000000 0.0 0.000000 0.000000 0.00 1.000000
window 0.500000 0.0 0.500000 0.000000 0.00 0.000000
print(counts.transform(pd.Series.combine, 'index', grouped_age_bin.size(), lambda num, tot: f'{100 * num / tot:.1f}% (n={num})'))
age_bin -30 30-39 40-49 50-59 60-69 70+
location status
england Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
married 33.3% (n=1) 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 33.3% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
ireland Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 50.0% (n=1)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2) 0.0% (n=0)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
window 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
jersey Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
scotland Other 0.0% (n=0) 0.0% (n=0) 50.0% (n=1) 0.0% (n=0) 0.0% (n=0) 50.0% (n=1)
divorced 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
married 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=2)
single 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=3)
window 25.0% (n=1) 0.0% (n=0) 0.0% (n=0) 25.0% (n=1) 0.0% (n=0) 50.0% (n=2)
wales Other 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 100.0% (n=1) 0.0% (n=0) 0.0% (n=0)
divorced 16.7% (n=1) 0.0% (n=0) 33.3% (n=2) 0.0% (n=0) 0.0% (n=0) 50.0% (n=3)
married 0.0% (n=0) 33.3% (n=1) 0.0% (n=0) 0.0% (n=0) 0.0% (n=0) 66.7% (n=2)
single 0.0% (n=0) 0.0% (n=0) 33.3% (n=1) 33.3% (n=1) 0.0% (n=0) 33.3% (n=1)