R-分别将一个数据帧合并/连接到多个数据帧

R-分别将一个数据帧合并/连接到多个数据帧,r,dplyr,left-join,R,Dplyr,Left Join,我已经看到几个线程将多个数据帧合并到一个“主”数据帧中,但是我想取一个参考数据帧并将其应用到其他几个数据帧,同时保持其他数据帧的分离。我尝试过使用lappy和“for”循环,但还没有找到方法。(免责声明:我是R的新手。) df_geo是参考数据框,看起来像这样: district sector cell village village_code west sectorA cellA villageA XXXXXXXX west sectorA cellA vil

我已经看到几个线程将多个数据帧合并到一个“主”数据帧中,但是我想取一个参考数据帧并将其应用到其他几个数据帧,同时保持其他数据帧的分离。我尝试过使用lappy和“for”循环,但还没有找到方法。(免责声明:我是R的新手。)

df_geo是参考数据框,看起来像这样:

district sector cell    village  village_code
west    sectorA cellA   villageA    XXXXXXXX
west    sectorA cellA   villageB    XXXXXXXX
west    sectorB cellB   villageC    XXXXXXXX
south   sectorC cellC   villageD    XXXXXXXX
还有三个其他数据集,它们有更多列,其中包含特定于数据类型的信息,即分布、调查、跟踪。每个数据集都有地区、部门、单元和村庄(名称相同)列。例如:

> df_distr
v1  district sector cell    village     v2  v3  …
..  west    sectorA cellA   villageA    ..  ..  …
..  west    sectorA cellA   villageB    ..  ..  …
..  west    sectorB cellB   villageC    ..  ..  …
..  south   sectorC cellC   villageD    ..  ..  …

每个数据帧具有不同的列数,并且每个数据帧中的位置变量不在相同的数字列中。每个地区-部门-单元-村庄组合都是唯一的,每个村庄的编码也是唯一的。我正在尝试向三个数据框中的每一个添加一个村庄代码列,这三个数据框根据地区扇区单元格村庄匹配记录8位数字位置ID。理想情况下,我希望将列附加到每个原始数据帧(而不是存储在列表中)。因此,我希望它们看起来像这样:

> df_distr
v1  district sector cell    village     v2  v3  …  village_code
..  west    sectorA cellA   villageA    ..  ..  …    XXXXXXXX
..  west    sectorA cellA   villageB    ..  ..  …    XXXXXXXX
..  west    sectorB cellB   villageC    ..  ..  …    XXXXXXXX
..  south   sectorC cellC   villageD    ..  ..  …    XXXXXXXX

> df_survey
v1  v5  v6  district sector  cell   village     v7  …  village_code
..  ..  ..  west    sectorA cellA   villageA    ..  ..   XXXXXXXX
..  ..  ..  west    sectorA cellA   villageB    ..  ..   XXXXXXXX
..  ..  ..  west    sectorB cellB   villageC    ..  ..   XXXXXXXX
..  ..  ..  south   sectorC cellC   villageD    ..  ..   XXXXXXXX

我可以使用
df_distr这样的代码一个接一个地成功地完成这项工作,假设我有一组四个数据帧
reference
是df_geo的替身,
选项卡*
数据框表示您正在使用的未知表

reference = data.frame(key = letters[1:10],value = 1:10)
tab1 = data.frame(journey = LETTERS[1:3],key=letters[1:3])
tab2 = data.frame(trip = LETTERS[7:10],key=letters[7:10])
tab3 = data.frame(destination = LETTERS[4:8],key=letters[4:8])
目标是将
引用
连接到彼此的数据帧

output = lapply(list(tab1=tab1,tab2=tab2,tab3=tab3),left_join,reference,by="key")
在这里,我为
选项卡*
数据帧制作了一个命名列表(重要),并使用
lappy
对每个帧应用相同的函数,在这种情况下,
left\u join
。指定函数后,我可以为函数提供其他必要的参数—在本例中,是要联接的表(
reference
)和指示应如何联接的
by=“key”

这将返回一个数据帧列表,但您似乎希望将其分配回全局环境

lapply(names(output),function(x){
  assign(x,value=output[[x]],envir=globalenv())
})
这一次,当我们使用
lappy
时,我们对输出的名称执行此操作(这是我们在上一步中指定的名称)。我们
将带有该名称的
output
的值分配给名称,并将其分配到全局环境中

lapply(names(output),function(x){
  assign(x,value=output[[x]],envir=globalenv())
})

现在所有的
选项卡*
数据帧都已更新,我们所要做的就是编写一个需要更新的数据帧的命名列表。

假设我有一组四个数据帧。
参考
是df_geo的替身,
选项卡*
数据帧表示您正在使用的未知表伊思

reference = data.frame(key = letters[1:10],value = 1:10)
tab1 = data.frame(journey = LETTERS[1:3],key=letters[1:3])
tab2 = data.frame(trip = LETTERS[7:10],key=letters[7:10])
tab3 = data.frame(destination = LETTERS[4:8],key=letters[4:8])
目标是将
引用
连接到彼此的数据帧

output = lapply(list(tab1=tab1,tab2=tab2,tab3=tab3),left_join,reference,by="key")
在这里,我为
选项卡*
数据帧制作了一个命名列表(重要),并使用
lappy
将相同的函数应用于每个帧,在本例中,
left\u join
。指定函数后,我可以为函数提供其他必要的参数-在本例中,是要连接的表(
reference
)以及
by=“key”
来指示它们应该如何连接

这将返回一个数据帧列表,但您似乎希望将其分配回全局环境

lapply(names(output),function(x){
  assign(x,value=output[[x]],envir=globalenv())
})
这一次,当我们使用
lappy
时,我们对输出的名称执行此操作(这是我们在上一步中指定的名称)。我们
将带有该名称的
output
的值分配给名称,并将其分配到全局环境中

lapply(names(output),function(x){
  assign(x,value=output[[x]],envir=globalenv())
})

现在,所有的
选项卡*
数据帧都已更新,我们要做的就是编写需要更新的数据帧的命名列表。

您可以利用
数据。表的modify by reference属性附加
village\u code
列,而无需重新分配数据帧:

library(data.table)

setDT(df_geo)
setDT(df_distr)
setDT(df_survey)

lapply(list(df_distr, df_survey), 
       function(x) x[df_geo, village_code := i.village_code, 
                     on=.(district, sector, cell, village)])
请注意,
df_distr
df_survey
会打印出来,这是由于
lappy
的结果,但是data.frames本身已经成功地修改到位。如果您只想得到
:=
的副作用,您可以使用
purr::walk
,它的工作原理类似于
lappy/map
,但会抑制输出:

library(purrr)
walk(list(df_distr, df_survey), ~ .[df_geo, village_code := i.village_code,
                                    on=.(district, sector, cell, village)])
请注意,此方法也比重新指定快得多,因为在原地修改data.frames时没有复制

结果:

> df_distr
   v1 district  sector  cell  village v2 v3 village_code
1: ..     west sectorA cellA villageA .. ..     XXXXXXXX
2: ..     west sectorA cellA villageB .. ..     XXXXXXXX
3: ..     west sectorB cellB villageC .. ..     XXXXXXXX
4: ..    south sectorC cellC villageD .. ..     XXXXXXXX

> df_survey
   v1 v5 v6 district  sector  cell  village v7 village_code
1: .. .. ..     west sectorA cellA villageA ..     XXXXXXXX
2: .. .. ..     west sectorA cellA villageB ..     XXXXXXXX
3: .. .. ..     west sectorB cellB villageC ..     XXXXXXXX
4: .. .. ..    south sectorC cellC villageD ..     XXXXXXXX
df_geo = read.table(text = "district sector cell    village  village_code
west    sectorA cellA   villageA    XXXXXXXX
                    west    sectorA cellA   villageB    XXXXXXXX
                    west    sectorB cellB   villageC    XXXXXXXX
                    south   sectorC cellC   villageD    XXXXXXXX", header = TRUE)

df_distr = read.table(text = "v1  district sector cell    village     v2  v3
..  west    sectorA cellA   villageA    ..  ..
..  west    sectorA cellA   villageB    ..  ..
..  west    sectorB cellB   villageC    ..  ..
..  south   sectorC cellC   villageD    ..  ..", header = TRUE)


df_survey = read.table(text = "v1  v5  v6  district sector  cell   village     v7  
..  ..  ..  west    sectorA cellA   villageA    ..
..  ..  ..  west    sectorA cellA   villageB    ..
..  ..  ..  west    sectorB cellB   villageC    ..
..  ..  ..  south   sectorC cellC   villageD    ..", header = TRUE)
数据:

> df_distr
   v1 district  sector  cell  village v2 v3 village_code
1: ..     west sectorA cellA villageA .. ..     XXXXXXXX
2: ..     west sectorA cellA villageB .. ..     XXXXXXXX
3: ..     west sectorB cellB villageC .. ..     XXXXXXXX
4: ..    south sectorC cellC villageD .. ..     XXXXXXXX

> df_survey
   v1 v5 v6 district  sector  cell  village v7 village_code
1: .. .. ..     west sectorA cellA villageA ..     XXXXXXXX
2: .. .. ..     west sectorA cellA villageB ..     XXXXXXXX
3: .. .. ..     west sectorB cellB villageC ..     XXXXXXXX
4: .. .. ..    south sectorC cellC villageD ..     XXXXXXXX
df_geo = read.table(text = "district sector cell    village  village_code
west    sectorA cellA   villageA    XXXXXXXX
                    west    sectorA cellA   villageB    XXXXXXXX
                    west    sectorB cellB   villageC    XXXXXXXX
                    south   sectorC cellC   villageD    XXXXXXXX", header = TRUE)

df_distr = read.table(text = "v1  district sector cell    village     v2  v3
..  west    sectorA cellA   villageA    ..  ..
..  west    sectorA cellA   villageB    ..  ..
..  west    sectorB cellB   villageC    ..  ..
..  south   sectorC cellC   villageD    ..  ..", header = TRUE)


df_survey = read.table(text = "v1  v5  v6  district sector  cell   village     v7  
..  ..  ..  west    sectorA cellA   villageA    ..
..  ..  ..  west    sectorA cellA   villageB    ..
..  ..  ..  west    sectorB cellB   villageC    ..
..  ..  ..  south   sectorC cellC   villageD    ..", header = TRUE)

您可以利用
数据.table
的modify by reference属性附加
village\u code
列,而无需重新分配数据帧:

library(data.table)

setDT(df_geo)
setDT(df_distr)
setDT(df_survey)

lapply(list(df_distr, df_survey), 
       function(x) x[df_geo, village_code := i.village_code, 
                     on=.(district, sector, cell, village)])
请注意,
df_distr
df_survey
会打印出来,这是由于
lappy
的结果,但是data.frames本身已经成功地修改到位。如果您只想得到
:=
的副作用,您可以使用
purr::walk
,它的工作原理类似于
lappy/map
,但会抑制输出:

library(purrr)
walk(list(df_distr, df_survey), ~ .[df_geo, village_code := i.village_code,
                                    on=.(district, sector, cell, village)])
请注意,此方法也比重新指定快得多,因为在原地修改data.frames时没有复制

结果:

> df_distr
   v1 district  sector  cell  village v2 v3 village_code
1: ..     west sectorA cellA villageA .. ..     XXXXXXXX
2: ..     west sectorA cellA villageB .. ..     XXXXXXXX
3: ..     west sectorB cellB villageC .. ..     XXXXXXXX
4: ..    south sectorC cellC villageD .. ..     XXXXXXXX

> df_survey
   v1 v5 v6 district  sector  cell  village v7 village_code
1: .. .. ..     west sectorA cellA villageA ..     XXXXXXXX
2: .. .. ..     west sectorA cellA villageB ..     XXXXXXXX
3: .. .. ..     west sectorB cellB villageC ..     XXXXXXXX
4: .. .. ..    south sectorC cellC villageD ..     XXXXXXXX
df_geo = read.table(text = "district sector cell    village  village_code
west    sectorA cellA   villageA    XXXXXXXX
                    west    sectorA cellA   villageB    XXXXXXXX
                    west    sectorB cellB   villageC    XXXXXXXX
                    south   sectorC cellC   villageD    XXXXXXXX", header = TRUE)

df_distr = read.table(text = "v1  district sector cell    village     v2  v3
..  west    sectorA cellA   villageA    ..  ..
..  west    sectorA cellA   villageB    ..  ..
..  west    sectorB cellB   villageC    ..  ..
..  south   sectorC cellC   villageD    ..  ..", header = TRUE)


df_survey = read.table(text = "v1  v5  v6  district sector  cell   village     v7  
..  ..  ..  west    sectorA cellA   villageA    ..
..  ..  ..  west    sectorA cellA   villageB    ..
..  ..  ..  west    sectorB cellB   villageC    ..
..  ..  ..  south   sectorC cellC   villageD    ..", header = TRUE)
数据:

> df_distr
   v1 district  sector  cell  village v2 v3 village_code
1: ..     west sectorA cellA villageA .. ..     XXXXXXXX
2: ..     west sectorA cellA villageB .. ..     XXXXXXXX
3: ..     west sectorB cellB villageC .. ..     XXXXXXXX
4: ..    south sectorC cellC villageD .. ..     XXXXXXXX

> df_survey
   v1 v5 v6 district  sector  cell  village v7 village_code
1: .. .. ..     west sectorA cellA villageA ..     XXXXXXXX
2: .. .. ..     west sectorA cellA villageB ..     XXXXXXXX
3: .. .. ..     west sectorB cellB villageC ..     XXXXXXXX
4: .. .. ..    south sectorC cellC villageD ..     XXXXXXXX
df_geo = read.table(text = "district sector cell    village  village_code
west    sectorA cellA   villageA    XXXXXXXX
                    west    sectorA cellA   villageB    XXXXXXXX
                    west    sectorB cellB   villageC    XXXXXXXX
                    south   sectorC cellC   villageD    XXXXXXXX", header = TRUE)

df_distr = read.table(text = "v1  district sector cell    village     v2  v3
..  west    sectorA cellA   villageA    ..  ..
..  west    sectorA cellA   villageB    ..  ..
..  west    sectorB cellB   villageC    ..  ..
..  south   sectorC cellC   villageD    ..  ..", header = TRUE)


df_survey = read.table(text = "v1  v5  v6  district sector  cell   village     v7  
..  ..  ..  west    sectorA cellA   villageA    ..
..  ..  ..  west    sectorA cellA   villageB    ..
..  ..  ..  west    sectorB cellB   villageC    ..
..  ..  ..  south   sectorC cellC   villageD    ..", header = TRUE)

如果我理解正确,您希望将每个数据集与参考数据集左键联接,以便所有数据集都附加了
village\u code
。@用户是的,这是正确的如果我理解正确,您希望将每个数据集与参考数据集左键联接,以便所有数据集都附加了
village\u code
。@用户是的,这是正确的T