假设上下边界均为均匀分布,如何从上下边界模拟R值?
我有以下几点:假设上下边界均为均匀分布,如何从上下边界模拟R值?,r,statistics,R,Statistics,我有以下几点: # A tibble: 1,100 x 3 income minimum maximum <dbl> <dbl> <dbl> 1 NA NA NA 2 0 0 25 3 0 0 25 4 NA
# A tibble: 1,100 x 3
income minimum maximum
<dbl> <dbl> <dbl>
1 NA NA NA
2 0 0 25
3 0 0 25
4 NA NA NA
5 4 100 200
#一个tible:1100 x 3
最低收入最高收入
1NA NA NA
2 0 0 25
3 0 0 25
4娜娜娜娜
5 4 100 200
我想从最小值和最大值模拟一个值,假设它们服从均匀分布
你知道怎么做吗?
模拟值应显示在可变收入下的右侧。使用
apply()
尝试这种方法。您可以使用runif()
在行级别使用lowerboundary
和upperboundary
变量生成值。对于那些带有NA
的行,您将得到NaN
。代码如下:
#Code
df$Salary <- apply(df[,-1],1,function(x) {y <- runif(1,x[1],x[2]); y})
使用的一些数据:
#Data
df <- structure(list(income = c(NA, 0L, 0L, NA, 4L, NA, NA, 4L, NA,
12L), lowerboundary = c(NA, 0L, 0L, NA, 425L, NA, NA, 425L, NA,
2400L), upperboundary = c(NA, 50L, 50L, NA, 600L, NA, NA, 600L,
NA, 3000L)), row.names = c(NA, -10L), class = "data.frame")
#数据
df这可能就是您想要的:
df$salary <- runif(nrow(df)) * (df$upperboundary - df$lowerboundary) + df$lowerboundary
但是,也可以直接定义边界:
df$salary <- runif(nrow(df), df$lowerboundary, df$upperboundary)
让我们看一下1,手动定义一个最大值和一个最小值
默认情况下,runif(1)
等于:
runif(1, min = 0, max = 1)
因此,它根据均匀分布返回0到1之间的随机数
要返回两个不同限制之间的随机数,例如min=10
和max=20
,可以通过以下方式执行:
runif(1, min = 10, max = 20)
或
如果runif的输出为1:
1 * (20 - 10) + 10
==> 20 - 10 + 10
==> 20
这里还有另一种选择,即使用dplyr
应用解决方案:
library(dplyr)
df %>%
rowwise() %>%
mutate(salary = runif(1, lowerboundary, upperboundary)) %>%
ungroup()
这是一个速度比较。“数学”是最快的:
microbenchmark::microbenchmark(
apply = apply(df[-1],1, function(x) runif(1, x[1], x[2])),
maths = runif(nrow(df)) * (df$upperboundary - df$lowerboundary) + df$lowerboundary,
maths2 = runif(nrow(df), df$lowerboundary, df$upperboundary),
dplyr = df %>% rowwise() %>% mutate(runif = runif(1, lowerboundary, upperboundary)) %>% ungroup()
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> apply 907.1 955.90 1175.188 1023.70 1280.90 4455.0 100
#> maths 16.8 26.05 32.651 31.25 38.65 75.0 100
#> maths2 117.8 128.00 156.533 136.60 175.15 336.7 100
#> dplyr 1424.2 1496.60 1821.068 1661.15 1989.20 3952.7 100
我们可以从purrr
library(purrr)
library(dplyr)
df %>%
mutate(salary = map2_dbl(lowerboundary, upperboundary, ~ runif(1, .x, .y)))
-输出
# income lowerboundary upperboundary salary
#1 NA NA NA NaN
#2 0 0 50 33.771312
#3 0 0 50 3.577857
#4 NA NA NA NaN
#5 4 425 600 514.912989
#6 NA NA NA NaN
#7 NA NA NA NaN
#8 4 425 600 516.179313
#9 NA NA NA NaN
#10 12 2400 3000 2815.442543
我认为用(df,runif(nrow(df),lowerboundary,upperboundary))来做就足够了。
对不起,我犯了一个错误。你说得对!我会编辑我的回答谢谢你的编辑;我想,如果数据的大小增加,那么Math和math2将相似。显然,NAs的存在会降低math2的速度。请检查您的数据帧是否称为df。否则,将解决方案中的df替换为数据帧的实际名称
1 * (20 - 10) + 10
==> 20 - 10 + 10
==> 20
library(dplyr)
df %>%
rowwise() %>%
mutate(salary = runif(1, lowerboundary, upperboundary)) %>%
ungroup()
microbenchmark::microbenchmark(
apply = apply(df[-1],1, function(x) runif(1, x[1], x[2])),
maths = runif(nrow(df)) * (df$upperboundary - df$lowerboundary) + df$lowerboundary,
maths2 = runif(nrow(df), df$lowerboundary, df$upperboundary),
dplyr = df %>% rowwise() %>% mutate(runif = runif(1, lowerboundary, upperboundary)) %>% ungroup()
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> apply 907.1 955.90 1175.188 1023.70 1280.90 4455.0 100
#> maths 16.8 26.05 32.651 31.25 38.65 75.0 100
#> maths2 117.8 128.00 156.533 136.60 175.15 336.7 100
#> dplyr 1424.2 1496.60 1821.068 1661.15 1989.20 3952.7 100
library(purrr)
library(dplyr)
df %>%
mutate(salary = map2_dbl(lowerboundary, upperboundary, ~ runif(1, .x, .y)))
# income lowerboundary upperboundary salary
#1 NA NA NA NaN
#2 0 0 50 33.771312
#3 0 0 50 3.577857
#4 NA NA NA NaN
#5 4 425 600 514.912989
#6 NA NA NA NaN
#7 NA NA NA NaN
#8 4 425 600 516.179313
#9 NA NA NA NaN
#10 12 2400 3000 2815.442543