R 利用抽样提高数据集填充效率
给定的R 利用抽样提高数据集填充效率,r,performance,probability,sampling,R,Performance,Probability,Sampling,给定的myletters: library(tidyverse) myletters <- letters myletters # [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m&
myletters
:
library(tidyverse)
myletters <- letters
myletters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
以下方法符合我在模拟中的要求,但扩展性不太好,因为它在一个单元格中用多达400个字母填充数据帧列,因此既笨拙又低效:
output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(myletters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))
output
#plot
output %>%
group_by(drawsX) %>%
summarise(prob_of_all_letters = mean(all_letters)) %>%
ggplot(., aes(drawsX, prob_of_all_letters)) +
geom_line() +
scale_y_continuous(labels = scales::percent_format()) +
labs(y = "Probability")
输出%
mutate(draws_output=map(drawsX,~unlist(重新运行(,sample(myletters,4,replace=F))),
所有字母=映射lgl(绘制输出,~n不同(.)==26))
输出
#密谋
输出%>%
分组人(抽签人)%>%
总结(所有字母的概率=平均值(所有字母))%>%
ggplot(、aes(图纸、所有字母的概率))+
geom_线()+
比例是连续的(标签=比例::百分比格式()+
实验室(y=“概率”)
理想情况下,我希望模拟更多次,例如trial=1:100000
,但如果我想这样做,上述方法效率低下
1)有没有更有效的方法用样本填充我的数据集(或使用矩阵)
2)另外,是否有一种分析方法来解决R中的问题,而不是模拟。e、 g.从4个样品的10张图纸中获得26个字母的概率是多少
谢谢这里有一个稍微改进的版本。代码更高效,更干净:
sample_sets = function(replicates, k, set = letters) {
draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
all(seq_along(set) %in% draws)
}
## example use
output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
## timing
system.time({output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
})
# user system elapsed
# 2.79 0.04 2.95
## original way
system.time({output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user system elapsed
# 4.96 0.06 5.18
sample_set=函数(复制,k,set=字母){
draws=vapply(1:复制,函数(z,…)sample.int(…),FUN.VALUE=integer(k),n=length(set),size=k,replace=FALSE)
全部(按百分比绘制的顺序(设置)%)
}
##示例使用
产量%
变异(
结果=map\U lgl(图X,样本集,集=字母,k=4),
)
##时机
系统时间({输出%
变异(
结果=map\U lgl(图X,样本集,集=字母,k=4),
)
})
#用户系统运行时间
# 2.79 0.04 2.95
##原始方式
系统时间({输出%
mutate(draws_output=map(drawsX,~unlist)(重新运行(,sample(字母,4,replace=F)),
所有字母=映射(绘制输出,~n不同(..==26)))
#用户系统运行时间
# 4.96 0.06 5.18
因此,在这些数据上大约快了40%——希望随着
绘制的增加,性能将继续提高。绘制样本几乎肯定不是您的瓶颈。在尝试加速单个片段之前,您应该先分析代码,看看到底什么是慢的部分。如果是我,我会编写一个函数,将X
和myletters
作为输入,并返回TRUE
或FALSE
。在函数内部使用矩阵
可能比数据帧更快。然后,您可以为每个试验复制(或者重新运行,如果您愿意的话)该函数,并且只存储TRUE
或FALSE
结果,而不是存储100000个试验中的每个试验的X绘图,这似乎是在浪费内存。是的,可能有一个分析解决方案。您可以在math.stackexchange或stats.stackexchange上寻求帮助。
sample_sets = function(replicates, k, set = letters) {
draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
all(seq_along(set) %in% draws)
}
## example use
output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
## timing
system.time({output <- crossing(
drawsX = 1:100,
trial = 1:100
) %>%
mutate(
outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
)
})
# user system elapsed
# 2.79 0.04 2.95
## original way
system.time({output <- crossing(drawsX = 1:100,
trial = 1:100) %>%
mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user system elapsed
# 4.96 0.06 5.18