在R中创建数据透视表和分组字段
我正在尝试使用R为我的excel数据集构建一个透视表。我需要对数字进行分组(在一个名为“权重”的列中,权重范围为70-100。每个权重都有一个价格。我需要找到每个权重类别中的平均值(权重)、最大值(权重)、最小值(权重)和产品数。25个变量中约有3000个obs。权重和价格是其中的两个。 数据片段:在R中创建数据透视表和分组字段,r,excel,algorithm,R,Excel,Algorithm,我正在尝试使用R为我的excel数据集构建一个透视表。我需要对数字进行分组(在一个名为“权重”的列中,权重范围为70-100。每个权重都有一个价格。我需要找到每个权重类别中的平均值(权重)、最大值(权重)、最小值(权重)和产品数。25个变量中约有3000个obs。权重和价格是其中的两个。 数据片段: Weight Price Order No. Date_Ordered Invoiced_Date Region 85 $2300 78 $5600
Weight Price Order No. Date_Ordered Invoiced_Date Region
85 $2300
78 $5600
100 $3490
95 $2450
90 $5890
I am looking for something like:
Weight Count Mean(Price) Min(Price) Max(Price)
70-75(including 75)
75-80
80-85
85-90
90-95
95-100
我能够获得计数,但无法获得每个重量类别的平均值、最小值和最大值:
#Import the dataset
dataset = read.xlsx('Product_Data.xlsx')
gdataset <- group_by(dataset, Weight)
attach(gdataset)
periods <- seq(from = 70, to = 100, by 5)
snip < -cut(Weight, breaks = periods, right = TRUE, include.lowest = TRUE)
report <- cbind(table(snip))
#导入数据集
dataset=read.xlsx('Product_Data.xlsx'))
gdataset您的数据有点稀疏,因此我将为这个答案创建自己的数据。我将忽略其他列,尽管数据中的存在不会影响任何内容
set.seed(2)
n <- 100
dat <- data.frame(
Weight = sample(100, size=n, replace=TRUE),
Price = sample(9999, size=n, replace=TRUE)
)
head(dat)
# Weight Price
# 1 19 2010
# 2 71 4276
# 3 58 9806
# 4 17 8289
# 5 95 2870
# 6 95 5959
现在,我们只需将其分成几个组,并对每个组运行一个简单的摘要功能,将其包装回一个数据框中
do.call(rbind, by(dat$Price, dat$WeightBin, function(x) {
setNames(
sapply(c(length, mean, min, max), function(f) f(x)),
c("Count", "Mean(Price)", "Min(Price)", "Max(Price)")
)
}))
# Count Mean(Price) Min(Price) Max(Price)
# (0,5] 5 3919.000 1822 9536
# (5,10] 3 4287.000 1782 5690
# (10,15] 5 5402.200 2739 8989
# (15,20] 11 5192.545 1183 9192
# (20,25] 3 2868.667 137 7363
# (25,30] 6 6594.500 2855 9657
# (30,35] 5 2960.200 777 7486
# (35,40] 6 4937.000 850 9749
# (40,45] 7 5986.000 1307 9527
# (45,50] 4 5957.750 1475 9754
# (50,55] 3 3077.333 1287 4786
# (55,60] 4 4285.500 247 9806
# (60,65] 3 2633.000 450 6656
# (65,70] 4 4244.250 369 9038
# (70,75] 3 2616.333 652 4276
# (75,80] 5 7183.800 3734 8537
# (80,85] 6 4273.667 229 9788
# (85,90] 6 6659.000 1388 9637
# (90,95] 4 4301.750 2870 5959
# (95,100] 7 3967.857 872 8727
dplyr
我从存在的groupby
推断出您打算使用dplyr
。以下是获得类似结果的替代方法(从我的原始数据开始):
你好,欢迎来到堆栈溢出。为了帮助其他人回答你的问题,请考虑编辑它来添加一个最小的可重复的例子。
do.call(rbind, by(dat$Price, dat$WeightBin, function(x) {
setNames(
sapply(c(length, mean, min, max), function(f) f(x)),
c("Count", "Mean(Price)", "Min(Price)", "Max(Price)")
)
}))
# Count Mean(Price) Min(Price) Max(Price)
# (0,5] 5 3919.000 1822 9536
# (5,10] 3 4287.000 1782 5690
# (10,15] 5 5402.200 2739 8989
# (15,20] 11 5192.545 1183 9192
# (20,25] 3 2868.667 137 7363
# (25,30] 6 6594.500 2855 9657
# (30,35] 5 2960.200 777 7486
# (35,40] 6 4937.000 850 9749
# (40,45] 7 5986.000 1307 9527
# (45,50] 4 5957.750 1475 9754
# (50,55] 3 3077.333 1287 4786
# (55,60] 4 4285.500 247 9806
# (60,65] 3 2633.000 450 6656
# (65,70] 4 4244.250 369 9038
# (70,75] 3 2616.333 652 4276
# (75,80] 5 7183.800 3734 8537
# (80,85] 6 4273.667 229 9788
# (85,90] 6 6659.000 1388 9637
# (90,95] 4 4301.750 2870 5959
# (95,100] 7 3967.857 872 8727
library(dplyr)
dat %>%
group_by(Bin = cut(Weight, seq(0, 100, by=5))) %>%
summarize(
Count = n(),
Mean = mean(Price),
Min = min(Price),
Max = max(Price)
)