R:加速双环路
我正在寻找一个解决方案,以加快我的代码。我正在使用大约100000行的数据集,目前正在使用双for循环。我想这会减慢我的代码速度R:加速双环路,r,for-loop,rcpp,R,For Loop,Rcpp,我正在寻找一个解决方案,以加快我的代码。我正在使用大约100000行的数据集,目前正在使用双for循环。我想这会减慢我的代码速度 Example data: dt<-structure(list(name = c("Marcus", "Tina", "Jack", "George"), address = c("Oxford Str.", "Oxford Str.", "Waterloo Sq.", "London Str."), number = c(1, 1, 20, 1
Example data:
dt<-structure(list(name = c("Marcus", "Tina", "Jack", "George"),
address = c("Oxford Str.", "Oxford Str.", "Waterloo Sq.",
"London Str."), number = c(1, 1, 20, 15), suffix = c("a",
"a", NA, "b"), child = c("Tina", NA, "George", NA)), .Names = c("name",
"address", "number", "suffix", "child"), row.names = c(NA, -4L
), class = "data.frame")
Example DataFrame:
name address number suffix child
1 Marcus Oxford Str. 1 a Tina
2 Tina Oxford Str. 1 a
3 Jack Waterloo Sq. 20 George
4 George London Str. 15 b
我当前的代码:
df$output = 0
n = NROW(df)
for(i in 1:n) {
childID = df[i,5]
address = df[i,2]
number = df[i,3]
suffix = df[i,4]
for(j in 1:n) {
if((childID %in% df[j,1])&(address %in% df[j,2])&(number %in% df[j,3])
&(suffix %in% df[j,4]))
(df[i,6] = 1)
}
}
我尝试用C++代码来使用Rcpp。它也在工作,但仍然很慢。有没有加快这一速度的想法,或者我应该接受吗?运行它需要一些时间?我会尝试连接地址,然后使用
match
,如下所示:
# recreate your input (I put NAs where you have blanks)
DF <-
data.frame(name=c('Marcus','Tina','Jack','George'),
address=c('Oxford Str.','Oxford Str.','Waterloo Sq.','London Str.'),
number=c(1,1,20,15),
suffix=c('a','a',NA,'b'),
child=c('Tina',NA,'George',NA))
# create a single character address by concatenating address,number and suffix
fulladdr <- paste(DF$address,DF$number,DF$suffix,sep='||')
# initialize output to 0
DF$output <- 0
# set 1 where concatenated addresses match
DF$output[fulladdr[match(DF$child,DF$name)] == fulladdr] <- 1
> DF
name address number suffix child output
1 Marcus Oxford Str. 1 a Tina 1
2 Tina Oxford Str. 1 a <NA> 0
3 Jack Waterloo Sq. 20 <NA> George 0
4 George London Str. 15 b <NA> 0
#重新创建输入(我将NAs放在空白处)
DF我已经实现了一个data.table
解决方案,对于这个特定的数据集,它比@digEmAll解决方案慢,但可能还是有帮助的。
此外,我还提供了一些小的基准测试,这在这个小数据集上并没有实际意义,所以请在一个更大的数据集上测试它
library(data.table)
name = c("Marcus", "Tina", "Jack", "George")
address = c("Oxford Str.", "Oxford Str.", "Waterloo Sq.", "London Str.")
number = c(1, 1, 20, 15)
suffix = c("a", "a", "", "b")
child = c("Tina", "", "George", "")
dt <- data.table(name
, address
,number
,suffix
,child
)
dt[, FullAddr := paste0(address, " " , number, suffix)]
dt[ FullAddr[match(child,name)] == FullAddr, output := 1 ]
dt[is.na(output), output := 0]
dt
name address number suffix child FullAddr output
1: Marcus Oxford Str. 1 a Tina Oxford Str. 1a 1
2: Tina Oxford Str. 1 a Oxford Str. 1a 0
3: Jack Waterloo Sq. 20 George Waterloo Sq. 20 0
4: George London Str. 15 b London Str. 15b 0
library(microbenchmark)
microbenchmark(
a = {dt[ FullAddr[match(child,name)] == FullAddr, output := 1 ]}
, b= {df$output = 0
n = NROW(df)
for(i in 1:n) {
childID = df[i,5]
address = df[i,2]
number = df[i,3]
suffix = df[i,4]
for(j in 1:n) {
if((childID %in% df[j,1])&(address %in% df[j,2])&(number %in% df[j,3])
&(suffix %in% df[j,4]))
(df[i,6] = 1)
}
}}
, c = df$output[fulladdr[match(df$child,df$name)] == fulladdr] <- 1
, times = 100L
)
Unit: microseconds
expr min lq mean median uq max neval cld
a 298.842 348.347 427.59415 413.6995 489.4665 903.467 100 a
b 15042.275 15494.461 17983.16735 15864.5405 16257.7130 162306.656 100 b
c 39.847 46.487 58.82731 59.1655 64.7495 165.420 100 a
库(data.table)
name=c(“马库斯”、“蒂娜”、“杰克”、“乔治”)
地址=c(“牛津街”、“牛津街”、“滑铁卢广场”、“伦敦街”)
数字=c(1,1,20,15)
后缀=c(“a”、“a”、“b”)
child=c(“蒂娜”、“乔治”、“乔治”)
dt这里是一个基于注释中提到的hashmap
的解决方案:
df <- read.csv(text = 'name,address,number,suffix,child
Marcus,Oxford Str.,1,a,Tina
Tina,Oxford Str.,1,a,
Jack,Waterloo Sq.,20,,George
George,London Str.,15,b,', stringsAsFactors = FALSE)
df
library(hashmap)
address <- paste(df$address, df$number, df$suffix)
name_address <- hashmap(df$name, address)
child_address <- name_address[[df$child]]
output <- as.integer(child_address == address)
output <- ifelse(is.na(output), '', as.character(output))
df$output <- output
df
您可以将数据添加到问题中以便于修补吗?您可以使用不同的数据结构:对所有行进行一次遍历,然后将每行插入到一个问题中。在所有行的第二个循环中,您可以在固定时间内查找子行。这应该给你O(N)而不是O(N^2)。请参见此处:获取R的哈希表。双循环的意义是什么?是否要检查数据的名称
列中是否存在子项
值,或者是否还要检查地址
和其他变量是否匹配您也可以使用合并来简化它,%
中的%可以替换为==
,因为您只是在比较一个值和另一个值,这太棒了。非常感谢你!我认为基准测试只适用于实际大小的数据集。当然,但我想让他自己用自己的数据集尝试一下。这很好。我只想在回答中提到,一个现实的基准需要现实的数据大小。
df <- read.csv(text = 'name,address,number,suffix,child
Marcus,Oxford Str.,1,a,Tina
Tina,Oxford Str.,1,a,
Jack,Waterloo Sq.,20,,George
George,London Str.,15,b,', stringsAsFactors = FALSE)
df
library(hashmap)
address <- paste(df$address, df$number, df$suffix)
name_address <- hashmap(df$name, address)
child_address <- name_address[[df$child]]
output <- as.integer(child_address == address)
output <- ifelse(is.na(output), '', as.character(output))
df$output <- output
df
> df
name address number suffix child output
1 Marcus Oxford Str. 1 a Tina 1
2 Tina Oxford Str. 1 a
3 Jack Waterloo Sq. 20 George 0
4 George London Str. 15 b