如何为r中的更大数据优化此for循环?

如何为r中的更大数据优化此for循环?,r,optimization,R,Optimization,我有一些可复制的数据,我的原始数据集包含大约2000000行。由于这个原因,我的For循环变得效率低下,运行这么多数据需要很长时间。我想知道是否有更有效的方法来运行这些数据。我用可复制的数据附加了我的代码 #----Reproducible data example--------------------# #Upload first data set# words1<-c("How","did","Quebec","nationalists","see","their","provinc

我有一些可复制的数据,我的原始数据集包含大约2000000行。由于这个原因,我的For循环变得效率低下,运行这么多数据需要很长时间。我想知道是否有更有效的方法来运行这些数据。我用可复制的数据附加了我的代码

#----Reproducible data example--------------------#
#Upload first data set#
words1<-c("How","did","Quebec","nationalists","see","their","province","as","a","nation","in","the","1960s")
words2<-c("Why","does","volicty","effect","time",'?',NA,NA,NA,NA,NA,NA,NA)
words3<-c("How","do","I","wash","a","car",NA,NA,NA,NA,NA,NA,NA)
library<-c("The","the","How","see","as","a","for","then","than","example")
embedding1<-c(.5,.6,.7,.8,.9,.3,.46,.48,.53,.42)
embedding2<-c(.1,.5,.4,.8,.9,.3,.98,.73,.48,.56)
df <- data.frame(words1,words2,words3)
names(df)<-c("words1","words2","words3")

#--------Upload 2nd dataset-------#
df2 <- data.frame(library,embedding1, embedding2)
names(df2)<-c("library","embedding1","embedding2")
df2$meanembedding=rowMeans(df2[c("embedding1","embedding2")],na.rm=T)
df2<-df2[,-c(2,3)]

#-----Find columns--------#
l=ncol(df)
names<-names(df)
head(names)
classes<-sapply(df[,c(1:l)],class)
head(classes)

#------Combine and match libary to training data------#
require(gridExtra)
List = list()
for( name in names){
  df1<-df[,name]
  df1<-as.data.frame(df1)
  x_train2<-merge(x= df1, y = df2, 
                  by.x = "df1", by.y = 'library',all.x=T, sort=F)
  x_train2<-x_train2[,-1]
  x_train2<-as.data.frame(x_train2)
  names(x_train2) <- name
  List[[length(List)+1]] = x_train2
}

更好的方法是使用Lappy:

我们在向量名称df、子集和合并上循环,使用[drop=FALSE]防止将单列data.frame简化为向量,并覆盖列名。输出是一个列表


Post脚本:正如@RuiBarradas指出的,如果使用df[x]而不是df[,x],从技术上讲,您不需要drop=FALSE。但我认为,在需要同时对行和列进行子集划分的情况下,了解drop=FALSE选项是很有帮助的。

在连接大数据量时,请尝试data.table

library( data.table )

dt <- as.data.table( df )
dt2 <- as.data.table ( df2 )

lapply( names(dt), function(x) {
  on_expr <- parse( text = paste0( "c( library = \"", x, "\")" ) )
  dt2[dt, on = eval( on_expr )][,2]
})

# [[1]]
#     meanembedding
#  1:          0.55
#  2:            NA
#  3:            NA
#  4:            NA
#  5:          0.80
#  6:            NA
#  7:            NA
#  8:          0.90
#  9:          0.30
# 10:            NA
# 11:            NA
# 12:          0.55
# 13:            NA
# 
# [[2]]
#     meanembedding
#  1:            NA
#  2:            NA
#  3:            NA
#  4:            NA
#  5:            NA
#  6:            NA
#  7:            NA
#  8:            NA
#  9:            NA
# 10:            NA
# 11:            NA
# 12:            NA
# 13:            NA
# 
# [[3]]
#     meanembedding
#  1:          0.55
#  2:            NA
#  3:            NA
#  4:            NA
#  5:          0.30
#  6:            NA
#  7:            NA
#  8:            NA
#  9:            NA
# 10:            NA
# 11:            NA
# 12:            NA
# 13:            NA

您的代码生成错误:数据中有错误。framewords1、words2、words3:参数意味着行数不同:13、6I使用NA值修复了此问题,感谢初学者,永远不要将基函数用作对象名称!names是一个基本函数,因此行名称如果df是一个数据帧,那么df[i]也是,不需要两条指令,其中一条指令调用as.data.frame。x_Train2也是一样。是的,我正在努力简化整个循环,有很多地方需要优化。用Lappy制定了一个解决方案。这将大大简化和加速事情。
library( data.table )

dt <- as.data.table( df )
dt2 <- as.data.table ( df2 )

lapply( names(dt), function(x) {
  on_expr <- parse( text = paste0( "c( library = \"", x, "\")" ) )
  dt2[dt, on = eval( on_expr )][,2]
})

# [[1]]
#     meanembedding
#  1:          0.55
#  2:            NA
#  3:            NA
#  4:            NA
#  5:          0.80
#  6:            NA
#  7:            NA
#  8:          0.90
#  9:          0.30
# 10:            NA
# 11:            NA
# 12:          0.55
# 13:            NA
# 
# [[2]]
#     meanembedding
#  1:            NA
#  2:            NA
#  3:            NA
#  4:            NA
#  5:            NA
#  6:            NA
#  7:            NA
#  8:            NA
#  9:            NA
# 10:            NA
# 11:            NA
# 12:            NA
# 13:            NA
# 
# [[3]]
#     meanembedding
#  1:          0.55
#  2:            NA
#  3:            NA
#  4:            NA
#  5:          0.30
#  6:            NA
#  7:            NA
#  8:            NA
#  9:            NA
# 10:            NA
# 11:            NA
# 12:            NA
# 13:            NA