R:加速双环路

R:加速双环路,r,for-loop,rcpp,R,For Loop,Rcpp,我正在寻找一个解决方案,以加快我的代码。我正在使用大约100000行的数据集,目前正在使用双for循环。我想这会减慢我的代码速度 Example data: dt<-structure(list(name = c("Marcus", "Tina", "Jack", "George"), address = c("Oxford Str.", "Oxford Str.", "Waterloo Sq.", "London Str."), number = c(1, 1, 20, 1

我正在寻找一个解决方案,以加快我的代码。我正在使用大约100000行的数据集,目前正在使用双for循环。我想这会减慢我的代码速度

Example data:

dt<-structure(list(name = c("Marcus", "Tina", "Jack", "George"), 
  address = c("Oxford Str.", "Oxford Str.", "Waterloo Sq.", 
  "London Str."), number = c(1, 1, 20, 15), suffix = c("a", 
  "a", NA, "b"), child = c("Tina", NA, "George", NA)), .Names = c("name", 
  "address", "number", "suffix", "child"), row.names = c(NA, -4L
  ), class = "data.frame")

Example DataFrame:
     name       address      number   suffix   child
1    Marcus     Oxford Str.  1        a        Tina
2    Tina       Oxford Str.  1        a     
3    Jack       Waterloo Sq. 20                George
4    George     London Str.  15       b        
我当前的代码:

df$output = 0
n = NROW(df)

for(i in 1:n) {
 childID = df[i,5]
 address = df[i,2]
 number = df[i,3]
 suffix = df[i,4]
   for(j in 1:n) {
       if((childID %in% df[j,1])&(address %in% df[j,2])&(number %in% df[j,3])
         &(suffix %in% df[j,4]))
           (df[i,6] = 1)
    }
}

我尝试用C++代码来使用Rcpp。它也在工作,但仍然很慢。有没有加快这一速度的想法,或者我应该接受吗?运行它需要一些时间?

我会尝试连接地址,然后使用
match
,如下所示:

# recreate your input (I put NAs where you have blanks)
DF <- 
data.frame(name=c('Marcus','Tina','Jack','George'),
           address=c('Oxford Str.','Oxford Str.','Waterloo Sq.','London Str.'),
           number=c(1,1,20,15),
           suffix=c('a','a',NA,'b'),
           child=c('Tina',NA,'George',NA))

# create a single character address by concatenating address,number and suffix
fulladdr <- paste(DF$address,DF$number,DF$suffix,sep='||')
# initialize output to 0
DF$output <- 0
# set 1 where concatenated addresses match
DF$output[fulladdr[match(DF$child,DF$name)] == fulladdr] <- 1

> DF
    name      address number suffix  child output
1 Marcus  Oxford Str.      1      a   Tina      1
2   Tina  Oxford Str.      1      a   <NA>      0
3   Jack Waterloo Sq.     20   <NA> George      0
4 George  London Str.     15      b   <NA>      0
#重新创建输入(我将NAs放在空白处)

DF我已经实现了一个
data.table
解决方案,对于这个特定的数据集,它比@digEmAll解决方案慢,但可能还是有帮助的。 此外,我还提供了一些小的基准测试,这在这个小数据集上并没有实际意义,所以请在一个更大的数据集上测试它

library(data.table)
name = c("Marcus", "Tina", "Jack", "George")
address = c("Oxford Str.", "Oxford Str.", "Waterloo Sq.", "London Str.")
number = c(1, 1, 20, 15)
suffix = c("a", "a", "", "b")
child = c("Tina", "", "George", "")

dt <- data.table(name
                 , address
                 ,number
                 ,suffix
                 ,child
                 )
dt[, FullAddr := paste0(address, " " , number, suffix)]
dt[ FullAddr[match(child,name)] == FullAddr, output := 1  ]

dt[is.na(output), output := 0]
dt
   name      address number suffix  child        FullAddr output
1: Marcus  Oxford Str.      1      a   Tina  Oxford Str. 1a      1
2:   Tina  Oxford Str.      1      a         Oxford Str. 1a      0
3:   Jack Waterloo Sq.     20        George Waterloo Sq. 20      0
4: George  London Str.     15      b        London Str. 15b      0

library(microbenchmark)

microbenchmark(
        a = {dt[ FullAddr[match(child,name)] == FullAddr, output := 1  ]}
        , b= {df$output = 0
        n = NROW(df)

        for(i in 1:n) {
                childID = df[i,5]
                address = df[i,2]
                number = df[i,3]
                suffix = df[i,4]
                for(j in 1:n) {
                        if((childID %in% df[j,1])&(address %in% df[j,2])&(number %in% df[j,3])
                           &(suffix %in% df[j,4]))
                                (df[i,6] = 1)
                }
        }}
        , c = df$output[fulladdr[match(df$child,df$name)] == fulladdr] <- 1

       , times = 100L

)

    Unit: microseconds
 expr       min        lq        mean     median         uq        max neval cld
    a   298.842   348.347   427.59415   413.6995   489.4665    903.467   100  a 
    b 15042.275 15494.461 17983.16735 15864.5405 16257.7130 162306.656   100   b
    c    39.847    46.487    58.82731    59.1655    64.7495    165.420   100  a 
库(data.table)
name=c(“马库斯”、“蒂娜”、“杰克”、“乔治”)
地址=c(“牛津街”、“牛津街”、“滑铁卢广场”、“伦敦街”)
数字=c(1,1,20,15)
后缀=c(“a”、“a”、“b”)
child=c(“蒂娜”、“乔治”、“乔治”)

dt这里是一个基于注释中提到的
hashmap
的解决方案:

df <- read.csv(text = 'name,address,number,suffix,child
Marcus,Oxford Str.,1,a,Tina
Tina,Oxford Str.,1,a,     
Jack,Waterloo Sq.,20,,George
George,London Str.,15,b,', stringsAsFactors = FALSE)
df

library(hashmap)
address <- paste(df$address, df$number, df$suffix)
name_address <- hashmap(df$name, address)
child_address <- name_address[[df$child]]
output <- as.integer(child_address == address)
output <- ifelse(is.na(output), '', as.character(output))              

df$output <- output
df

您可以将数据添加到问题中以便于修补吗?您可以使用不同的数据结构:对所有行进行一次遍历,然后将每行插入到一个问题中。在所有行的第二个循环中,您可以在固定时间内查找子行。这应该给你O(N)而不是O(N^2)。请参见此处:获取R的哈希表。双循环的意义是什么?是否要检查数据的
名称
列中是否存在
子项
值,或者是否还要检查
地址
和其他变量是否匹配您也可以使用合并来简化它,%
中的
%可以替换为
==
,因为您只是在比较一个值和另一个值,这太棒了。非常感谢你!我认为基准测试只适用于实际大小的数据集。当然,但我想让他自己用自己的数据集尝试一下。这很好。我只想在回答中提到,一个现实的基准需要现实的数据大小。
df <- read.csv(text = 'name,address,number,suffix,child
Marcus,Oxford Str.,1,a,Tina
Tina,Oxford Str.,1,a,     
Jack,Waterloo Sq.,20,,George
George,London Str.,15,b,', stringsAsFactors = FALSE)
df

library(hashmap)
address <- paste(df$address, df$number, df$suffix)
name_address <- hashmap(df$name, address)
child_address <- name_address[[df$child]]
output <- as.integer(child_address == address)
output <- ifelse(is.na(output), '', as.character(output))              

df$output <- output
df
> df
    name      address number suffix  child output
1 Marcus  Oxford Str.      1      a   Tina      1
2   Tina  Oxford Str.      1      a              
3   Jack Waterloo Sq.     20        George      0
4 George  London Str.     15      b