R 利用抽样提高数据集填充效率

R 利用抽样提高数据集填充效率,r,performance,probability,sampling,R,Performance,Probability,Sampling,给定的myletters: library(tidyverse) myletters <- letters myletters # [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m&

给定的
myletters

library(tidyverse)
myletters <- letters
myletters
#  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
以下方法符合我在模拟中的要求,但扩展性不太好,因为它在一个单元格中用多达400个字母填充数据帧列,因此既笨拙又低效:

output <- crossing(drawsX = 1:100,
                       trial = 1:100) %>%
  mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(myletters, 4, replace = F)))),
         all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))
output

#plot
output %>%
  group_by(drawsX) %>%
  summarise(prob_of_all_letters = mean(all_letters)) %>% 
  ggplot(., aes(drawsX, prob_of_all_letters)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = "Probability")
输出%
mutate(draws_output=map(drawsX,~unlist(重新运行(,sample(myletters,4,replace=F))),
所有字母=映射lgl(绘制输出,~n不同(.)==26))
输出
#密谋
输出%>%
分组人(抽签人)%>%
总结(所有字母的概率=平均值(所有字母))%>%
ggplot(、aes(图纸、所有字母的概率))+
geom_线()+
比例是连续的(标签=比例::百分比格式()+
实验室(y=“概率”)
理想情况下,我希望模拟更多次,例如
trial=1:100000
,但如果我想这样做,上述方法效率低下

1)有没有更有效的方法用样本填充我的数据集(或使用矩阵)

2)另外,是否有一种分析方法来解决R中的问题,而不是模拟。e、 g.从4个样品的10张图纸中获得26个字母的概率是多少


谢谢

这里有一个稍微改进的版本。代码更高效,更干净:

sample_sets = function(replicates, k, set = letters) {
  draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
  all(seq_along(set) %in% draws)
}

## example use
output <- crossing(
    drawsX = 1:100,
    trial = 1:100
  ) %>%
  mutate(
    outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
  )

## timing
system.time({output <- crossing(
    drawsX = 1:100,
    trial = 1:100
  ) %>%
  mutate(
    outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
  )
})
# user  system elapsed 
# 2.79    0.04    2.95 


## original way
system.time({output <- crossing(drawsX = 1:100,
                       trial = 1:100) %>%
  mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
         all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user  system elapsed 
# 4.96    0.06    5.18 
sample_set=函数(复制,k,set=字母){
draws=vapply(1:复制,函数(z,…)sample.int(…),FUN.VALUE=integer(k),n=length(set),size=k,replace=FALSE)
全部(按百分比绘制的顺序(设置)%)
}
##示例使用
产量%
变异(
结果=map\U lgl(图X,样本集,集=字母,k=4),
)
##时机
系统时间({输出%
变异(
结果=map\U lgl(图X,样本集,集=字母,k=4),
)
})
#用户系统运行时间
# 2.79    0.04    2.95 
##原始方式
系统时间({输出%
mutate(draws_output=map(drawsX,~unlist)(重新运行(,sample(字母,4,replace=F)),
所有字母=映射(绘制输出,~n不同(..==26)))
#用户系统运行时间
# 4.96    0.06    5.18 

因此,在这些数据上大约快了40%——希望随着
绘制的增加,性能将继续提高。

绘制样本几乎肯定不是您的瓶颈。在尝试加速单个片段之前,您应该先分析代码,看看到底什么是慢的部分。如果是我,我会编写一个函数,将
X
myletters
作为输入,并返回
TRUE
FALSE
。在函数内部使用
矩阵
可能比数据帧更快。然后,您可以
为每个试验复制
(或者
重新运行
,如果您愿意的话)该函数,并且只存储
TRUE
FALSE
结果,而不是存储100000个试验中的每个试验的X绘图,这似乎是在浪费内存。是的,可能有一个分析解决方案。您可以在math.stackexchange或stats.stackexchange上寻求帮助。
sample_sets = function(replicates, k, set = letters) {
  draws = vapply(1:replicates, function(z, ...) sample.int(...), FUN.VALUE = integer(k), n = length(set), size = k, replace = FALSE)
  all(seq_along(set) %in% draws)
}

## example use
output <- crossing(
    drawsX = 1:100,
    trial = 1:100
  ) %>%
  mutate(
    outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
  )

## timing
system.time({output <- crossing(
    drawsX = 1:100,
    trial = 1:100
  ) %>%
  mutate(
    outcome = map_lgl(drawsX, sample_sets, set = letters, k = 4),
  )
})
# user  system elapsed 
# 2.79    0.04    2.95 


## original way
system.time({output <- crossing(drawsX = 1:100,
                       trial = 1:100) %>%
  mutate(draws_output = map(drawsX, ~ unlist(rerun(., sample(letters, 4, replace = F)))),
         all_letters = map_lgl(draws_output, ~ n_distinct(.) == 26))})
# user  system elapsed 
# 4.96    0.06    5.18