Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/loops/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
合并R中的大数据集并标记不匹配_R_Loops_Join_Merge_Match - Fatal编程技术网

合并R中的大数据集并标记不匹配

合并R中的大数据集并标记不匹配,r,loops,join,merge,match,R,Loops,Join,Merge,Match,我试图连续地连接多个数据集,并标记第一个数据集中的观测值,这些观测值在随后的数据集中找不到匹配项。下面是一个例子,我模拟了原始数据集加上三个额外的连接。当前的代码符合我的要求,但效率很低。对于大型数据集,可能需要几天的时间。是否可以使用apply或其他函数执行此任务 #Toy datasets: x, y, z and w #dataset X id <- c(1:10, 1:100) X1 <- rnorm(110, mean = 0, sd = 1) year <- c(

我试图连续地连接多个数据集,并标记第一个数据集中的观测值,这些观测值在随后的数据集中找不到匹配项。下面是一个例子,我模拟了原始数据集加上三个额外的连接。当前的代码符合我的要求,但效率很低。对于大型数据集,可能需要几天的时间。是否可以使用apply或其他函数执行此任务

#Toy datasets: x, y, z and w

#dataset X
id <- c(1:10, 1:100)
X1 <- rnorm(110, mean = 0, sd = 1)
year <- c("2004","2005","2006","2001","2002") 
year <- rep(year, 22)

month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 11)

x <- data.frame(id, X1, month, year)

#dataset Y
id2 <- c(1:10, 41:110)
Y1 <- rnorm(80, mean = 0 , sd = 1)
year <- c("2004","2005","2006","2001") 
year <- rep(year, 20)

month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 8)

y <- data.frame(id2,Y1, year,month)


#dataset z 
id3 = c(1:60, 401:10000)
Z1 = rpois(9660, 10) 
year = c('2004','2005','2006','2002')
year = rep(year, 2415)

month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 966)

z = data.frame(id3,Z1,year,month)

#dataset w
id4 = c(1:300, 20:29)
W1 = rnorm(310, 20, 36)
year = c('2004','2005','2006','2000','2002')
year = rep(year, 62)

month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 31)

w = data.frame(id4, W1, year, month)


x$id2 = x$yflag = x$zflag = x$wflag = rep(NA, nrow(x))


y.index = rep(NA, nrow(x))
z.index = rep(NA, nrow(x))
w.index = rep(NA, nrow(x))

for(i in 1:nrow(x)) {

  #compare to dataset y, insert yflag == 1 if the same ID, month, year is in x, otherwise 0 
  y.index = which(as.character(y$id2) == as.character(x$id[i]) 
                     & as.character(y$year) == as.character(x$year[i])
                     & as.character(y$month) == as.character(x$month[i])) 
  x$yflag[i] = ifelse(length(y.index==1), 1, 0)
  x$id2[i] = ifelse(length(y.index) == 1, y$id2[y.index], x$id[i])

  ## compare to dataset z, insert zflag == 1 if the same ID, month, year is in x, otherwise 0
  z.index <- which(as.character(z$id3) == as.character(x$id[i])
                   & as.character(z$month) == as.character(x$month[i])
                   & as.character(z$year) == as.character(x$year[i]))
  x$zflag[i] <- ifelse(length(z.index == 1), 1, 0)


  ## compare to dataset w, insert wflag == 1 if the same ID, month, year is in x, otherwise 0
  w.index <- which(as.character(w$id4) == as.character(x$id[i]) 
                   & as.character(w$month) == as.character(x$month[i])
                   & as.character(w$year) == as.character(x$year[i]))
  x$wflag[i] <- ifelse(length(w.index == 1), 1, 0)  
}

print(x)
玩具数据集:x、y、z和w #数据集X
id众多解决方案之一:
创建所有四个
数据帧之后

x$match.idx <- do.call(paste, c(x[,c("id", "month", "year")], sep=":"))
y$match.idx <- do.call(paste, c(y[,c("id2", "month", "year")], sep=":"))
z$match.idx <- do.call(paste, c(z[,c("id3", "month", "year")], sep=":"))
w$match.idx <- do.call(paste, c(w[,c("id4", "month", "year")], sep=":"))

xy.m <- match(x$match.idx, y$match.idx)
xz.m <- match(x$match.idx, z$match.idx)
xw.m <- match(x$match.idx, w$match.idx)
x$yflag <- x$zflag <- x$wflag <- 0
x$yflag[which(!is.na(xy.m))] <- 1
x$zflag[which(!is.na(xz.m))] <- 1
x$wflag[which(!is.na(xw.m))] <- 1

x <- subset(x, select=-c(match.idx))
> head(x)

  id         X1 month year wflag zflag yflag
1  1 -0.2470932   Jul 2004     1     1     1
2  2  0.2262816   Aug 2005     1     1     1
3  3  0.8473442   Sep 2006     1     1     1
4  4  0.9338628   Oct 2001     0     0     1
5  5 -0.1385540   Nov 2002     1     0     0
6  6  0.7825385   Dec 2004     1     0     0

x$match.idx我建议将
in()
interaction()
组合如下:

output <- within(x, {
    temp <- interaction(id, month, year) # Something to match to
    # The actual matching takes place here
    # The `+0` at the end is a lazy way to convert
    #   TRUE and FALSE logical values to numeric 1 and 0
    wflag <- temp %in% with(w, interaction(id4, month, year)) + 0
    zflag <- temp %in% with(z, interaction(id3, month, year)) + 0
    yflag <- temp %in% with(y, interaction(id2, month, year)) + 0
    # Remove the temp variable that we created 
    #   since it's no longer required.
    rm(temp)
})

head(output)
#   id          X1 month year yflag zflag wflag
# 1  1 -0.03595218   Jul 2004     1     1     1
# 2  2  0.56329165   Aug 2005     1     1     1
# 3  3  0.74372988   Sep 2006     1     1     1
# 4  4  1.49634088   Oct 2001     1     0     0
# 5  5  0.23107131   Nov 2002     0     0     1
# 6  6  0.15121196   Dec 2004     0     0     1
tail(output)
#      id         X1 month year yflag zflag wflag
# 105  95 -0.0911546   Nov 2002     0     0     1
# 106  96 -0.4140724   Dec 2004     0     0     1
# 107  97 -0.1477702   Jan 2005     0     0     1
# 108  98 -0.3164388   Feb 2006     0     0     1
# 109  99 -0.5082118   Mar 2001     0     0     0
# 110 100 -0.6072856   Apr 2002     0     0     1

output您是否尝试过
merge()
?merge没有正确地标记观察结果,它抛出的信息在某种意义上说我看不到这些标记的等价物<例如,code>test.merge=merge(x,y,by.x='id',by.y='id2')
。当然,我可能没有正确地实现它。如果您尝试了
match()
函数,match会返回x和y之间匹配的位置。例如,
test=match(x$id,y$id2)
,但这并不能更好地标记观察结果。此外,match不允许多个“ID”,因此您不能使用月份和年份的信息。您可以做的一件事是在每个
数据框中添加
stringsAsFactors=FALSE
(x、y、z和w)。在使用函数之后,您可以调用例如
which(y$id2==x$id[i]&y$year==x$year[i]&y$month==x$month[i])
而不是
which(as.character(y$id2)==as.character(x$id[i])&as.character(y$year)==as.character(x$year[i])&as.character(y$month)==as.character(x$month[i])
将索引列与
粘贴()相结合的好主意。
+很好<代码>粘贴()
实际上比我使用的
交互()要快得多。这里有一个
data.table
方法,假设您已经创建了所有
data.frame
s(例如DTx、DTy…)的
data.table
s。我发现语法更清晰:
temp