R 对于每个观察，在由因子确定的子集上找到相应的百分位数_R

R 对于每个观察，在由因子确定的子集上找到相应的百分位数

R 对于每个观察，在由因子确定的子集上找到相应的百分位数,r,R,假设我有这样一个数据帧： df<-data.frame(f=rep(c("a", "b", "c", "d"), 100), value=rnorm(400)) df带dplyr：库（dplyr） df%>% 按（f）分组%>% 变异（quant=findInterval（值，分位数（值））） #>来源：本地数据帧[400 x 3] #>组别:f[4] #> #>f值量 #> #>1 a 0.51184061 3 #>2 b 0.44362348 3 #>3

假设我有这样一个数据帧：

df<-data.frame(f=rep(c("a", "b", "c", "d"), 100), value=rnorm(400))

df带dplyr：

库（dplyr）
df%>%
按（f）分组%>%
变异（quant=findInterval（值，分位数（值）））
#>来源：本地数据帧[400 x 3]
#>组别:f[4]
#> 
#>f值量
#>            
#>1 a 0.51184061 3
#>2 b 0.44362348 3
#>3 c-1.04869448 1
#>4 d-2.41772425 1
#>5 a 0.10738332 3
#>6b-0.58630348 1
#>7 c 0.34376820 3
#>8 d 0.68322738 4
#>9 a 1.00232314 4
#>10 b 0.05499391 3
#> # ... 还有390行

带有数据。表：
库（data.table）
dtf值量
#>1:a 0.3608395 3
#>2:b-0.1028948 2
#>3:c-2.1903336 1
#>4:d 0.7470262 4
#>5:a 0.5292031 3
#>  ---                   
#>396:d-1.3475332 1
#>397:a 0.1598605 3
#>398:b-0.4261003 2
#>399:c 0.3951650 3
#>400:d-1.4409000 1

数据：
df带dplyr：

库（dplyr）
df%>%
按（f）分组%>%
变异（quant=findInterval（值，分位数（值）））
#>来源：本地数据帧[400 x 3]
#>组别:f[4]
#> 
#>f值量
#>            
#>1 a 0.51184061 3
#>2 b 0.44362348 3
#>3 c-1.04869448 1
#>4 d-2.41772425 1
#>5 a 0.10738332 3
#>6b-0.58630348 1
#>7 c 0.34376820 3
#>8 d 0.68322738 4
#>9 a 1.00232314 4
#>10 b 0.05499391 3
#> # ... 还有390行

带有数据。表：
库（data.table）
dtf值量
#>1:a 0.3608395 3
#>2:b-0.1028948 2
#>3:c-2.1903336 1
#>4:d 0.7470262 4
#>5:a 0.5292031 3
#>  ---                   
#>396:d-1.3475332 1
#>397:a 0.1598605 3
#>398:b-0.4261003 2
#>399:c 0.3951650 3
#>400:d-1.4409000 1

数据：
df我认为data.table
更快，但是，不使用包的解决方案是：
基于cut
或findInterval
以及分位数定义函数
cut2 <- function(x){
cut( x , breaks=quantile(x, probs = seq(0, 1, 0.01)) , include.lowest=T  , labels=1:100)
}

我认为data.table
更快，但是，不使用软件包的解决方案是：
基于cut
或findInterval
以及分位数定义函数
cut2 <- function(x){
cut( x , breaks=quantile(x, probs = seq(0, 1, 0.01)) , include.lowest=T  , labels=1:100)
}

由于行数可能达到数百万，data.table
将具有更好的性能。简单快速。我使用了data.table
方法。由于行数可能达到数百万，data.table
将具有更好的性能。简单快速。我使用了data.table方法。
df$newColumn <- ave(df$values, df$f, FUN=cut2)