R 基于两列的交集绑定两个数据帧

R 基于两列的交集绑定两个数据帧,r,dataframe,R,Dataframe,嗨,伙计们,当R中的两列之间存在匹配时,我正在尝试绑定两个数据帧 X.1 <- runif(5) X.2 <- runif(5) fruit <- c("apple","apple","banana","orange","orange") month <- c("January","February","March","April","May") fruit.second <- c("apple","apple","apple","banana","orange

嗨,伙计们,当R中的两列之间存在匹配时,我正在尝试绑定两个数据帧

X.1 <- runif(5)
X.2 <- runif(5)
fruit <- c("apple","apple","banana","orange","orange")
month <- c("January","February","March","April","May")


fruit.second <- c("apple","apple","apple","banana","orange","orange")
month.second <- c("January","February","March","January","April","May")
Y.1 <- runif(6)
Y.2 <- runif(6)

df <- data.frame(X.1,X.2,as.character(fruit),as.character(month))
df


        X.1        X.2 as.character.fruit. as.character.month.
1 0.08694442 0.67541559               apple             January
2 0.50374582 0.04485657               apple            February
3 0.50482380 0.76090011              banana               March
4 0.75920285 0.61077744              orange               April
5 0.95243661 0.18064744              orange                 May  



df2 <- data.frame(as.character(fruit.second),as.character(month.second),Y.1,Y.2)
df2

  as.character.fruit.second. as.character.month.second.       Y.1       Y.2
1                      apple                    January 0.3407055 0.5740400
2                      apple                   February 0.1529912 0.8163872
3                      apple                      March 0.1042926 0.9807348
4                     banana                    January 0.1031409 0.7961291
5                     orange                      April 0.9537869 0.1840729
6                     orange                        May 0.3158263 0.8856582

X.1听起来像是一个
join
操作,例如在
dplyr
包中有效地实现


有4或5种类型的连接操作,查看文档或vignette中哪一种是正确的。您可能需要修改列的名称,为了使用基于列名标识匹配的联接操作。

我将
df2
中的
four.second
month.second
的名称更改为
four
month
,以方便使用
merge
参数,但如果你不改变,你也可以很容易地改变

merge(
  x=df,
  y=df2,
  by.x=c("fruit","month"),
  by.y=c("fruit.second","month.second")
)
而不是下面所做的

set.seed(1234)
X.1 <- runif(5)
set.seed(2345)
X.2 <- runif(5)
fruit <- c("apple","apple","banana","orange","orange")
month <- c("January","February","March","April","May")
##
fruit.second <- c("apple","apple","apple","banana","orange","orange")
month.second <- c("January","February","March","January","April","May")
set.seed(3456)
Y.1 <- runif(6)
set.seed(4567)
Y.2 <- runif(6)
##
df <- data.frame(
  X.1,X.2,
  fruit=as.character(fruit),
  month=as.character(month),
  stringsAsFactors=FALSE)
##
df2 <- data.frame(
  fruit=as.character(fruit.second),
  month=as.character(month.second),
  Y.1,Y.2,
  stringsAsFactors=FALSE)
##
merge(
  df,
  df2,
  by=c("fruit","month")
)
##
   fruit    month       X.1       X.2       Y.1       Y.2
1  apple February 0.6222994 0.1950251 0.7618600 0.7412554
2  apple  January 0.1137034 0.1167435 0.7785807 0.2309186
3 orange    April 0.6233794 0.0344546 0.5071998 0.5996399
4 orange      May 0.8609154 0.4751201 0.7980290 0.2773313
set.seed(1234)

X.1这是一种数据表方法。如果您的真实数据集很大,那么这将比使用数据帧进行
合并(…)
快得多

注:最重要的一位是末尾的四行。还要注意的是,data.table并不关心
水果
月份
是因素

set.seed(1)   # for reproducible example
X.1 <- runif(5)
X.2 <- runif(5)
fruit <- c("apple","apple","banana","orange","orange")
month <- c("January","February","March","April","May")

fruit.second <- c("apple","apple","apple","banana","orange","orange")
month.second <- c("January","February","March","January","April","May")
Y.1 <- runif(6)
Y.2 <- runif(6)

df <- data.frame(X.1,X.2,fruit,month)
df2 <- data.frame(fruit.second,month.second,Y.1,Y.2)

## This does the work.
library(data.table)
DT1 <- data.table(df,  key="fruit,month")
DT2 <- data.table(df2, key="fruit.second,month.second")
DT1[DT2,nomatch=0]
#     fruit    month       X.1        X.2       Y.1       Y.2
# 1:  apple February 0.3721239 0.94467527 0.1765568 0.9919061
# 2:  apple  January 0.2655087 0.89838968 0.2059746 0.7176185
# 3: orange    April 0.9082078 0.62911404 0.7698414 0.9347052
# 4: orange      May 0.2016819 0.06178627 0.4976992 0.2121425

这种方法“通过引用”将
df
df2
转换为data.tables,这意味着不需要复制。然后
setkey(…)
对它们进行排序并适当地设置键。然后,
df[df2,…]
进行连接。使用
nomatch=0
排除键列中没有匹配值的行(数据库术语中的内部联接)。

查看
merge
setkey(setDT(df),fruit,month)
setkey(setDT(df2),fruit.second,month.second)
df[df2,nomatch=0]