R cut2分成不同的铲斗
我目前正在做一些数据处理,并一直在寻找一种方法,在每组中创建具有相同观察次数的十分位数。我遇到了Hmisc包和cut2函数,感觉它应该通过指定g=10将数据分成10个桶,每个桶中的观察值数量相等。然而,这个函数的输出已经有相当大的偏差。我是否错误地使用了cut2 我正在使用的代码:R cut2分成不同的铲斗,r,R,我目前正在做一些数据处理,并一直在寻找一种方法,在每组中创建具有相同观察次数的十分位数。我遇到了Hmisc包和cut2函数,感觉它应该通过指定g=10将数据分成10个桶,每个桶中的观察值数量相等。然而,这个函数的输出已经有相当大的偏差。我是否错误地使用了cut2 我正在使用的代码: library(Hmisc) testdata <- data.frame(rating= c(8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8,
library(Hmisc)
testdata <- data.frame(rating= c(8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 8, 8, 6, 8, 8, 8, 8, 6, 8, 6, 8, 4, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 6, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 6, 8, 8, 6, 4, 8, 8, 8, 8, 8, 6, 8, 8, 8, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 2, 8, 6, 8, 8, 8, 6, 8, 8, 6, 6, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 8, 8, 6, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 6, 8, 8, 8, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 8, 8, 8, 8, 8, 6, 8, 8, 8, 6)
,age=c(0, 0, 0, 0, 3, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 9, 10, 10, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 30, 30, 30, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 39, 39, 39, 40, 40, 41, 41, 41, 41, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 44, 44, 44, 44, 44, 44, 45, 45, 45, 45, 45, 46, 46, 46, 46, 47, 47, 47, 48, 48, 48, 54, 54, 54, 56, 56, 58, 59, 59, 59, 59, 60, 60, 60, 61, 66, 66, 70, 72))
cutcutcut <- cut2(testdata$age,g=10)
testtable <- table(cutcutcut)
您的问题的答案在于查看您的数据分布:
table(testdata$age)
# 0 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# 4 1 4 6 4 3 4 2 2 16 9 7 5 10 6 7 7 13 4 2 9
# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
# 23 10 18 17 8 5 3 2 8 2 2 5 9 5 5 3 2 8 7 3 6
# 45 46 47 48 54 56 58 59 60 61 66 70 72
# 5 4 3 3 3 2 1 4 3 1 2 1 1
我们发现,一些年龄段在该年龄段有大量个体(例如,有16个个体的年龄为12岁,23个个体的年龄为24岁)。由于切割算法需要将所有年龄完全相同的个体放入同一个桶中,这可能会导致桶中的某些不平衡
由于您的数据中总共有309个观察值,您需要寻找10个桶,因此理想情况下,您希望在9个桶中有31个观察值,在最后一个桶中有30个观察值。现在,最后一个bucket被定义为
[46,72]
,它包含28个元素(太低)。如果将其扩展为[45,72]
,它将包含33个元素(太多)。由于有5个元素的值为45,因此无法拆分数据以获得最后一个存储桶中的30或31个观测值。您的问题的答案在于查看数据的分布:
table(testdata$age)
# 0 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# 4 1 4 6 4 3 4 2 2 16 9 7 5 10 6 7 7 13 4 2 9
# 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
# 23 10 18 17 8 5 3 2 8 2 2 5 9 5 5 3 2 8 7 3 6
# 45 46 47 48 54 56 58 59 60 61 66 70 72
# 5 4 3 3 3 2 1 4 3 1 2 1 1
我们发现,一些年龄段在该年龄段有大量个体(例如,有16个个体的年龄为12岁,23个个体的年龄为24岁)。由于切割算法需要将所有年龄完全相同的个体放入同一个桶中,这可能会导致桶中的某些不平衡
由于您的数据中总共有309个观察值,您需要寻找10个桶,因此理想情况下,您希望在9个桶中有31个观察值,在最后一个桶中有30个观察值。现在,最后一个bucket被定义为[46,72]
,它包含28个元素(太低)。如果将其扩展为[45,72]
,它将包含33个元素(太多)。由于有5个元素的值为45,因此无法拆分数据以在最后一个桶中获得30或31个观测值