Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/82.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 如何高效地对大型数据帧的多个列进行变异_R_Performance_Function_Datatable_Tidyr - Fatal编程技术网

R 如何高效地对大型数据帧的多个列进行变异

R 如何高效地对大型数据帧的多个列进行变异,r,performance,function,datatable,tidyr,R,Performance,Function,Datatable,Tidyr,如果能有效地将我的函数应用于我的大数据框的多列DT_large,我将不胜感激 当我将我的函数与dplyr::mutate_at()一起应用到一个小数据帧DT_small时,它工作得很好,效率很高。但是,当应用于相对较大的数据集时,需要花费数小时才能提供所需的输出 这可能是因为我的代码中存在一些错误,使得使用相对较大的数据集时,dplyr::mutate_at()的效率较低。或者,可能是dplyr::mutate_at()对于像我这样的相对较大的数据集来说效率不高 无论是哪种情况,我都希望能得到任

如果能有效地将
我的函数
应用于我的大数据框的多列
DT_large
,我将不胜感激

当我将我的函数与
dplyr::mutate_at()
一起应用到一个小数据帧
DT_small
时,它工作得很好,效率很高。但是,当应用于相对较大的数据集时,需要花费数小时才能提供所需的输出

这可能是因为我的代码中存在一些错误,使得使用相对较大的数据集时,
dplyr::mutate_at()
的效率较低。或者,可能是
dplyr::mutate_at()
对于像我这样的相对较大的数据集来说效率不高

无论是哪种情况,我都希望能得到任何帮助来解决我的问题,也就是说,用一种更快的方法来正确地将我的函数应用于
dtu-large
,并像应用于
dtu-small
时那样提供所需的输出

#小数据集

DT_small<-structure(list(.id = 1:10, `_E1.1` = c(0.475036902, 0.680123015, 
0.896920608, 0.329908621, 0.652288128, 0.408813318, 0.486444822, 
0.429333778, 2.643293032, 0.782194143), `_E1.2` = c(79.22653114, 
0.680123015, 4.088529776, 0.232076989, 0.652288128, 0.329908621, 
0.486444822, 0.429333778, 2.643293032, 0.963554482), `_E1.3` = c(0.466755502, 
0.680123015, 0.461887024, 1.236938197, 0.652288128, 0.408813318, 
0.486444822, 0.429333778, 2.643293032, 0.95778584), `_E1.4` = c(1.608298119, 
0.680123015, 0.578464999, 0.317125521, 0.652288128, 0.408813318, 
0.486444822, 0.429333778, 2.643293032, 2.125841957), `_E1.5` = c(0.438424932, 
0.680123015, 0.896920608, 0.366118007, 0.652288128, 1.007079029, 
0.486444822, 0.429333778, 2.643293032, 0.634134022), `_E10.1` = c(0.45697607, 
0.647681721, 1.143509029, 0.435735621, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 0.67895017), `_E10.2` = c(35.30312978, 
0.647681721, 2.58357783, 0.25514789, 0.49400961, 0.435735621, 
0.461123723, 0.568477247, 1.756598213, 0.776970116), `_E10.3` = c(0.79477661, 
0.647681721, 0.672430959, 0.886991224, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 1.019701072), `_E10.4` = c(1.912254794, 
0.647681721, 0.840757508, 0.414669983, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 1.576577576), `_E10.5` = c(0.429335115, 
0.647681721, 1.143509029, 0.336512868, 0.49400961, 0.82434125, 
0.461123723, 0.568477247, 1.756598213, 0.639407175), `_E100.1` = c(0.567579678, 
0.780423094, 1.739967261, 0.282217304, 0.784904687, 0.319146371, 
0.585056235, 0.596494912, 3.545358563, 0.899595619)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))
1) download to your directory from https://jmp.sh/iC6WOzw
2) DT_large <- read_csv("DT_large.csv")
//this perfectly delivers the desired output in seconds
DT_small %>% mutate_at(vars(matches("_E")),
                 funs(ifelse(
         DT_small$.>quantile(
                 DT_small$.,probs=0.80),quantile(
                 DT_small$.,probs=0.80),DT_small$.)))
//this takes several hours to deliver the desired output
DT_large %>% mutate_at(vars(matches("_E")),
                 funs(ifelse(
         DT_large$.>quantile(
                 DT_large$.,probs=0.80),quantile(
                 DT_large$.,probs=0.80),DT_large$.)))
#我的函数应用于我的大型数据集

DT_small<-structure(list(.id = 1:10, `_E1.1` = c(0.475036902, 0.680123015, 
0.896920608, 0.329908621, 0.652288128, 0.408813318, 0.486444822, 
0.429333778, 2.643293032, 0.782194143), `_E1.2` = c(79.22653114, 
0.680123015, 4.088529776, 0.232076989, 0.652288128, 0.329908621, 
0.486444822, 0.429333778, 2.643293032, 0.963554482), `_E1.3` = c(0.466755502, 
0.680123015, 0.461887024, 1.236938197, 0.652288128, 0.408813318, 
0.486444822, 0.429333778, 2.643293032, 0.95778584), `_E1.4` = c(1.608298119, 
0.680123015, 0.578464999, 0.317125521, 0.652288128, 0.408813318, 
0.486444822, 0.429333778, 2.643293032, 2.125841957), `_E1.5` = c(0.438424932, 
0.680123015, 0.896920608, 0.366118007, 0.652288128, 1.007079029, 
0.486444822, 0.429333778, 2.643293032, 0.634134022), `_E10.1` = c(0.45697607, 
0.647681721, 1.143509029, 0.435735621, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 0.67895017), `_E10.2` = c(35.30312978, 
0.647681721, 2.58357783, 0.25514789, 0.49400961, 0.435735621, 
0.461123723, 0.568477247, 1.756598213, 0.776970116), `_E10.3` = c(0.79477661, 
0.647681721, 0.672430959, 0.886991224, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 1.019701072), `_E10.4` = c(1.912254794, 
0.647681721, 0.840757508, 0.414669983, 0.49400961, 0.501421816, 
0.461123723, 0.568477247, 1.756598213, 1.576577576), `_E10.5` = c(0.429335115, 
0.647681721, 1.143509029, 0.336512868, 0.49400961, 0.82434125, 
0.461123723, 0.568477247, 1.756598213, 0.639407175), `_E100.1` = c(0.567579678, 
0.780423094, 1.739967261, 0.282217304, 0.784904687, 0.319146371, 
0.585056235, 0.596494912, 3.545358563, 0.899595619)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))
1) download to your directory from https://jmp.sh/iC6WOzw
2) DT_large <- read_csv("DT_large.csv")
//this perfectly delivers the desired output in seconds
DT_small %>% mutate_at(vars(matches("_E")),
                 funs(ifelse(
         DT_small$.>quantile(
                 DT_small$.,probs=0.80),quantile(
                 DT_small$.,probs=0.80),DT_small$.)))
//this takes several hours to deliver the desired output
DT_large %>% mutate_at(vars(matches("_E")),
                 funs(ifelse(
         DT_large$.>quantile(
                 DT_large$.,probs=0.80),quantile(
                 DT_large$.,probs=0.80),DT_large$.)))

提前感谢您的帮助。

您可以通过以下方式获得相当大的加速:1)计算一次分位数,2)对列应用新的更节省的函数

在我的机器上,这种方法大约快15倍

库(dplyr)
图书馆(微基准)
dplyr_res%在(变量(匹配项(“_E”))处变异,
funs(如果其他(
DT_small$.>分位数(
DT_小$,probs=0.80),分位数(
DT_small$,probs=0.80),DT_small$))
风电机组E1.1、E1.2、E1.3、E1.4、E1.5、E10.1、E10.2、E10.3、E10.4、E10.5
#>[1,]真的
#>[2,]真的
#>[3,]真的
#>[4,]真的
#>[5,]真的
#>[6,]真的
#>[7]真的
#>[8]真的
#>[9,]真的
#>[10]真的
#>_E100.1
#>[1,]正确
#>[2,]正确
#>[3,]正确
#>[4,]正确
#>[5,]正确
#>[6,]正确
#>[7,]正确
#>[8]正确
#>[9]正确
#>[10]正确
微基准(dplyr_res=DT_small%>%突变(变量(匹配项)(“_E”)),
funs(如果其他(
DT_small$.>分位数(
DT_小$,probs=0.50),分位数(
DT_small$,probs=0.50),DT_small$),
sapply_res=sappy(DT_small[,2:ncol(DT_small)],fun_col))
#>单位:毫秒
#>expr最小lq平均uq最大中值
#>dplyr_res 12.372519 12.668833 13.577804 12.856150 13.553805 60.220232
#>sapply_res 1.808413 1.850595 1.966174 1.874696 1.911037 3.441024
#>内瓦尔cld
#>100 b
#>100 a
在这里,节省重新计算可能需要做大量的工作。我没有明确测试
sapply()
是否比
mutate\u在
时快

并行运行的快速示例(仅当有许多列时才值得)
parallel::mcmapply(fun\u col,DT\u small%>%select(-.id))

取决于是否安装了并行软件包。

谢谢,@gfgm。我的目标不是计算中间值。刚刚将问题编辑为我实际在做什么。好吧,如果你切换到我的方法,用
分位数(x,.8)
交换中值,速度仍然快7-8倍。将编辑我的answer@Krantz完成。我没看你的大数据有多大。如果它有很多很多列,您可能需要考虑将 FuffelCor()/代码>应用到并行的列中。只需执行<代码> SpApple(dtSyt%%> %SELECT(匹配(“IE”))、Fun-CoCl < /COD>。您需要将输出分配给您想要编辑的变量,例如:代码> dty-小[],GRPL(“IE”),名称(DTY小))