R 如何高效地对大型数据帧的多个列进行变异_R_Performance_Function_Datatable_Tidyr

R 如何高效地对大型数据帧的多个列进行变异

r performance function

R 如何高效地对大型数据帧的多个列进行变异,r,performance,function,datatable,tidyr,R,Performance,Function,Datatable,Tidyr,如果能有效地将我的函数应用于我的大数据框的多列DT_large，我将不胜感激当我将我的函数与dplyr:：mutate_at（）一起应用到一个小数据帧DT_small时，它工作得很好，效率很高。但是，当应用于相对较大的数据集时，需要花费数小时才能提供所需的输出这可能是因为我的代码中存在一些错误，使得使用相对较大的数据集时，dplyr:：mutate_at（）的效率较低。或者，可能是dplyr:：mutate_at（）对于像我这样的相对较大的数据集来说效率不高无论是哪种情况，我都希望能得到任

如果能有效地将

我的函数

应用于我的大数据框的多列

DT_large

，我将不胜感激

当我将我的函数与

dplyr:：mutate_at（）

一起应用到一个小数据帧

DT_small

时，它工作得很好，效率很高。但是，当应用于相对较大的数据集时，需要花费数小时才能提供所需的输出

这可能是因为我的代码中存在一些错误，使得使用相对较大的数据集时，

dplyr:：mutate_at（）

的效率较低。或者，可能是

dplyr:：mutate_at（）

对于像我这样的相对较大的数据集来说效率不高

无论是哪种情况，我都希望能得到任何帮助来解决我的问题，也就是说，用一种更快的方法来正确地将我的函数应用于

dtu-large

，并像应用于

dtu-small

时那样提供所需的输出

#小数据集

DT_small<-structure(list(.id = 1:10, `_E1.1` = c(0.475036902, 0.680123015, 
0.896920608, 0.329908621, 0.652288128, 0.408813318, 0.486444822, 
0.429333778, 2.643293032, 0.782194143), `_E1.2` = c(79.22653114, 
0.680123015, 4.088529776, 0.232076989, 0.652288128, 0.329908621, 
0.486444822, 0.429333778, 2.643293032, 0.963554482), `_E1.3` = c(0.466755502, 
0.680123015, 0.461887024, 1.236938197, 0.652288128, 0.408813318, 
0.486444822, 0.429333778, 2.643293032, 0.95778584), `_E1.4` = c(1.608298119, 
0.680123015, 0.578464999, 0.317125521, 0.652288128, 0.408813318, 
0.486444822, 0.429333778, 2.643293032, 2.125841957), `_E1.5` = c(0.438424932, 
0.680123015, 0.896920608, 0.366118007, 0.652288128, 1.007079029, 
0.486444822, 0.429333778, 2.643293032, 0.634134022), `_E10.1` = c(0.45697607, 
0.647681721, 1.143509029, 0.435735621, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 0.67895017), `_E10.2` = c(35.30312978, 
0.647681721, 2.58357783, 0.25514789, 0.49400961, 0.435735621, 
0.461123723, 0.568477247, 1.756598213, 0.776970116), `_E10.3` = c(0.79477661, 
0.647681721, 0.672430959, 0.886991224, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 1.019701072), `_E10.4` = c(1.912254794, 
0.647681721, 0.840757508, 0.414669983, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 1.576577576), `_E10.5` = c(0.429335115, 
0.647681721, 1.143509029, 0.336512868, 0.49400961, 0.82434125, 
0.461123723, 0.568477247, 1.756598213, 0.639407175), `_E100.1` = c(0.567579678, 
0.780423094, 1.739967261, 0.282217304, 0.784904687, 0.319146371, 
0.585056235, 0.596494912, 3.545358563, 0.899595619)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

1) download to your directory from https://jmp.sh/iC6WOzw
2) DT_large <- read_csv("DT_large.csv")

//this perfectly delivers the desired output in seconds
DT_small %>% mutate_at(vars(matches("_E")),
                 funs(ifelse(
         DT_small$.>quantile(
                 DT_small$.,probs=0.80),quantile(
                 DT_small$.,probs=0.80),DT_small$.)))

//this takes several hours to deliver the desired output
DT_large %>% mutate_at(vars(matches("_E")),
                 funs(ifelse(
         DT_large$.>quantile(
                 DT_large$.,probs=0.80),quantile(
                 DT_large$.,probs=0.80),DT_large$.)))

#我的函数应用于我的大型数据集

DT_small<-structure(list(.id = 1:10, `_E1.1` = c(0.475036902, 0.680123015, 
0.896920608, 0.329908621, 0.652288128, 0.408813318, 0.486444822, 
0.429333778, 2.643293032, 0.782194143), `_E1.2` = c(79.22653114, 
0.680123015, 4.088529776, 0.232076989, 0.652288128, 0.329908621, 
0.486444822, 0.429333778, 2.643293032, 0.963554482), `_E1.3` = c(0.466755502, 
0.680123015, 0.461887024, 1.236938197, 0.652288128, 0.408813318, 
0.486444822, 0.429333778, 2.643293032, 0.95778584), `_E1.4` = c(1.608298119, 
0.680123015, 0.578464999, 0.317125521, 0.652288128, 0.408813318, 
0.486444822, 0.429333778, 2.643293032, 2.125841957), `_E1.5` = c(0.438424932, 
0.680123015, 0.896920608, 0.366118007, 0.652288128, 1.007079029, 
0.486444822, 0.429333778, 2.643293032, 0.634134022), `_E10.1` = c(0.45697607, 
0.647681721, 1.143509029, 0.435735621, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 0.67895017), `_E10.2` = c(35.30312978, 
0.647681721, 2.58357783, 0.25514789, 0.49400961, 0.435735621, 
0.461123723, 0.568477247, 1.756598213, 0.776970116), `_E10.3` = c(0.79477661, 
0.647681721, 0.672430959, 0.886991224, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 1.019701072), `_E10.4` = c(1.912254794, 
0.647681721, 0.840757508, 0.414669983, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 1.576577576), `_E10.5` = c(0.429335115, 
0.647681721, 1.143509029, 0.336512868, 0.49400961, 0.82434125, 
0.461123723, 0.568477247, 1.756598213, 0.639407175), `_E100.1` = c(0.567579678, 
0.780423094, 1.739967261, 0.282217304, 0.784904687, 0.319146371, 
0.585056235, 0.596494912, 3.545358563, 0.899595619)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

1) download to your directory from https://jmp.sh/iC6WOzw
2) DT_large <- read_csv("DT_large.csv")

//this perfectly delivers the desired output in seconds
DT_small %>% mutate_at(vars(matches("_E")),
                 funs(ifelse(
         DT_small$.>quantile(
                 DT_small$.,probs=0.80),quantile(
                 DT_small$.,probs=0.80),DT_small$.)))

//this takes several hours to deliver the desired output
DT_large %>% mutate_at(vars(matches("_E")),
                 funs(ifelse(
         DT_large$.>quantile(
                 DT_large$.,probs=0.80),quantile(
                 DT_large$.,probs=0.80),DT_large$.)))

提前感谢您的帮助。

您可以通过以下方式获得相当大的加速：1）计算一次分位数，2）对列应用新的更节省的函数

在我的机器上，这种方法大约快15倍

库（dplyr）
图书馆（微基准）
dplyr_res%在（变量（匹配项（“_E”））处变异，
funs（如果其他(
DT_small$.>分位数(
DT_小$，probs=0.80），分位数(
DT_small$，probs=0.80），DT_small$））
风电机组E1.1、E1.2、E1.3、E1.4、E1.5、E10.1、E10.2、E10.3、E10.4、E10.5
#>[1，]真的
#>[2，]真的
#>[3，]真的
#>[4，]真的
#>[5，]真的
#>[6，]真的
#>[7]真的
#>[8]真的
#>[9，]真的
#>[10]真的
#>_E100.1
#>[1，]正确
#>[2，]正确
#>[3，]正确
#>[4，]正确
#>[5，]正确
#>[6，]正确
#>[7，]正确
#>[8]正确
#>[9]正确
#>[10]正确
微基准（dplyr_res=DT_small%>%突变（变量（匹配项）（“_E”）），
funs（如果其他(
DT_small$.>分位数(
DT_小$，probs=0.50），分位数(
DT_small$，probs=0.50），DT_small$），
sapply_res=sappy（DT_small[，2:ncol（DT_small）]，fun_col））
#>单位：毫秒
#>expr最小lq平均uq最大中值
#>dplyr_res 12.372519 12.668833 13.577804 12.856150 13.553805 60.220232
#>sapply_res 1.808413 1.850595 1.966174 1.874696 1.911037 3.441024
#>内瓦尔cld
#>100 b
#>100 a

在这里，节省重新计算可能需要做大量的工作。我没有明确测试

sapply（）

是否比

mutate\u在

时快

并行运行的快速示例（仅当有许多列时才值得）

parallel:：mcmapply（fun\u col，DT\u small%>%select（-.id））

取决于是否安装了并行软件包。

谢谢，@gfgm。我的目标不是计算中间值。刚刚将问题编辑为我实际在做什么。好吧，如果你切换到我的方法，用

分位数（x，.8）

交换中值，速度仍然快7-8倍。将编辑我的answer@Krantz完成。我没看你的大数据有多大。如果它有很多很多列，您可能需要考虑将 FuffelCor（）/代码>应用到并行的列中。只需执行<代码> SpApple（dtSyt%%> %SELECT（匹配（“IE”））、Fun-CoCl < /COD>。您需要将输出分配给您想要编辑的变量，例如：代码> dty-小[]，GRPL（“IE”），名称（DTY小））