如何为r中的更大数据优化此for循环?
我有一些可复制的数据,我的原始数据集包含大约2000000行。由于这个原因,我的For循环变得效率低下,运行这么多数据需要很长时间。我想知道是否有更有效的方法来运行这些数据。我用可复制的数据附加了我的代码如何为r中的更大数据优化此for循环?,r,optimization,R,Optimization,我有一些可复制的数据,我的原始数据集包含大约2000000行。由于这个原因,我的For循环变得效率低下,运行这么多数据需要很长时间。我想知道是否有更有效的方法来运行这些数据。我用可复制的数据附加了我的代码 #----Reproducible data example--------------------# #Upload first data set# words1<-c("How","did","Quebec","nationalists","see","their","provinc
#----Reproducible data example--------------------#
#Upload first data set#
words1<-c("How","did","Quebec","nationalists","see","their","province","as","a","nation","in","the","1960s")
words2<-c("Why","does","volicty","effect","time",'?',NA,NA,NA,NA,NA,NA,NA)
words3<-c("How","do","I","wash","a","car",NA,NA,NA,NA,NA,NA,NA)
library<-c("The","the","How","see","as","a","for","then","than","example")
embedding1<-c(.5,.6,.7,.8,.9,.3,.46,.48,.53,.42)
embedding2<-c(.1,.5,.4,.8,.9,.3,.98,.73,.48,.56)
df <- data.frame(words1,words2,words3)
names(df)<-c("words1","words2","words3")
#--------Upload 2nd dataset-------#
df2 <- data.frame(library,embedding1, embedding2)
names(df2)<-c("library","embedding1","embedding2")
df2$meanembedding=rowMeans(df2[c("embedding1","embedding2")],na.rm=T)
df2<-df2[,-c(2,3)]
#-----Find columns--------#
l=ncol(df)
names<-names(df)
head(names)
classes<-sapply(df[,c(1:l)],class)
head(classes)
#------Combine and match libary to training data------#
require(gridExtra)
List = list()
for( name in names){
df1<-df[,name]
df1<-as.data.frame(df1)
x_train2<-merge(x= df1, y = df2,
by.x = "df1", by.y = 'library',all.x=T, sort=F)
x_train2<-x_train2[,-1]
x_train2<-as.data.frame(x_train2)
names(x_train2) <- name
List[[length(List)+1]] = x_train2
}
更好的方法是使用Lappy: 我们在向量名称df、子集和合并上循环,使用[drop=FALSE]防止将单列data.frame简化为向量,并覆盖列名。输出是一个列表
Post脚本:正如@RuiBarradas指出的,如果使用df[x]而不是df[,x],从技术上讲,您不需要drop=FALSE。但我认为,在需要同时对行和列进行子集划分的情况下,了解drop=FALSE选项是很有帮助的。在连接大数据量时,请尝试data.table
library( data.table )
dt <- as.data.table( df )
dt2 <- as.data.table ( df2 )
lapply( names(dt), function(x) {
on_expr <- parse( text = paste0( "c( library = \"", x, "\")" ) )
dt2[dt, on = eval( on_expr )][,2]
})
# [[1]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.80
# 6: NA
# 7: NA
# 8: 0.90
# 9: 0.30
# 10: NA
# 11: NA
# 12: 0.55
# 13: NA
#
# [[2]]
# meanembedding
# 1: NA
# 2: NA
# 3: NA
# 4: NA
# 5: NA
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA
#
# [[3]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.30
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA
您的代码生成错误:数据中有错误。framewords1、words2、words3:参数意味着行数不同:13、6I使用NA值修复了此问题,感谢初学者,永远不要将基函数用作对象名称!names是一个基本函数,因此行名称如果df是一个数据帧,那么df[i]也是,不需要两条指令,其中一条指令调用as.data.frame。x_Train2也是一样。是的,我正在努力简化整个循环,有很多地方需要优化。用Lappy制定了一个解决方案。这将大大简化和加速事情。
library( data.table )
dt <- as.data.table( df )
dt2 <- as.data.table ( df2 )
lapply( names(dt), function(x) {
on_expr <- parse( text = paste0( "c( library = \"", x, "\")" ) )
dt2[dt, on = eval( on_expr )][,2]
})
# [[1]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.80
# 6: NA
# 7: NA
# 8: 0.90
# 9: 0.30
# 10: NA
# 11: NA
# 12: 0.55
# 13: NA
#
# [[2]]
# meanembedding
# 1: NA
# 2: NA
# 3: NA
# 4: NA
# 5: NA
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA
#
# [[3]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.30
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA