R 条件键控联接/更新和更新匹配的标志列
这与@DavidArenburg询问的条件键联接非常相似,还有一个我似乎无法解决的问题 基本上,除了条件连接之外,我还想定义一个标志,表明匹配发生在匹配过程的哪个步骤;我的问题是,我只能为所有值定义标志,而不能为匹配的值定义标志 我希望这里有一个最简单的工作示例:R 条件键控联接/更新和更新匹配的标志列,r,data.table,R,Data.table,这与@DavidArenburg询问的条件键联接非常相似,还有一个我似乎无法解决的问题 基本上,除了条件连接之外,我还想定义一个标志,表明匹配发生在匹配过程的哪个步骤;我的问题是,我只能为所有值定义标志,而不能为匹配的值定义标志 我希望这里有一个最简单的工作示例: DT = data.table( name = c("Joe", "Joe", "Jim", "Carol", "Joe", "Carol", "Ann", "Ann", "Beth", "Joe", "Jo
DT = data.table(
name = c("Joe", "Joe", "Jim", "Carol", "Joe",
"Carol", "Ann", "Ann", "Beth", "Joe", "Joe"),
surname = c("Smith", "Smith", "Jones",
"Clymer", "Smith", "Klein", "Cotter",
"Cotter", "Brown", "Smith", "Smith"),
maiden_name = c("", "", "", "", "", "Clymer",
"", "", "", "", ""),
id = c(1, 1:3, rep(NA, 7)),
year = rep(1:4, c(4, 3, 2, 2)),
flag1 = NA, flag2 = NA, key = "year"
)
DT
# name surname maiden_name id year flag1 flag2
# 1: Joe Smith 1 1 FALSE FALSE
# 2: Joe Smith 1 1 FALSE FALSE
# 3: Jim Jones 2 1 FALSE FALSE
# 4: Carol Clymer 3 1 FALSE FALSE
# 5: Joe Smith NA 2 FALSE FALSE
# 6: Carol Klein Clymer NA 2 FALSE FALSE
# 7: Ann Cotter NA 2 FALSE FALSE
# 8: Ann Cotter NA 3 FALSE FALSE
# 9: Beth Brown NA 3 FALSE FALSE
# 10: Joe Smith NA 4 FALSE FALSE
# 11: Joe Smith NA 4 FALSE FALSE
我的方法是,每年首先尝试匹配上一年的名字/姓氏;如果失败了,那么尝试匹配名字/娘家姓。我想定义flag1
来表示精确匹配,flag2
来表示婚姻
for (yr in 2:4) {
#which ids have we hit so far?
existing_ids = DT[.(yr), unique(id)]
#find people in prior years appearing to
# correspond to those people
unmatched =
DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N], by = id]
setkey(unmatched, name, surname)
#merge a la Arun, define flag1
setkey(DT, name, surname)
DT[year == yr, c("id", "flag1") := unmatched[.SD, .(id, TRUE)]]
setkey(DT, year)
#repeat, this time keying on name/maiden_name
existing_ids = DT[.(yr), unique(id)]
unmatched =
DT[.(1:(yr - 1))][!id %in% existing_ids, .SD[.N],by=id]
setkey(unmatched, name, surname)
#now define flag2 = TRUE
setkey(DT, name, maiden_name)
DT[year==yr & is.na(id), c("id", "flag2") := unmatched[.SD, .(id, TRUE)]]
setkey(DT, year)
#this is messy, but I'm trying to increment id
# for "new" individuals
setkey(DT, name, surname, maiden_name)
DT[year == yr & is.na(id),
id := unique(
DT[year == yr & is.na(id)],
by = c("name", "surname", "maiden_name")
)[ , count := .I][.SD, count] + DT[ , max(id, na.rm = TRUE)]
]
#re-sort by year at the end
setkey(DT, year)
}
我希望在定义id
时,通过在j
参数中包含TRUE
值,只有匹配的名称
s(例如,Joe在第一步)才会将其标志更新为TRUE
,但事实并非如此——它们都已更新:
DT[]
# name surname maiden_name id year flag1 flag2
# 1: Carol Clymer 3 1 FALSE FALSE
# 2: Jim Jones 2 1 FALSE FALSE
# 3: Joe Smith 1 1 FALSE FALSE
# 4: Joe Smith 1 1 FALSE FALSE
# 5: Ann Cotter 4 2 TRUE TRUE
# 6: Carol Klein Clymer 3 2 TRUE TRUE
# 7: Joe Smith 1 2 TRUE FALSE
# 8: Ann Cotter 4 3 TRUE FALSE
# 9: Beth Brown 5 3 TRUE TRUE
# 10: Joe Smith 1 4 TRUE FALSE
# 11: Joe Smith 1 4 TRUE FALSE
有没有办法只更新匹配行的
标志值?理想输出如下:
DT[]
# name surname maiden_name id year flag1 flag2
# 1: Carol Clymer 3 1 FALSE FALSE
# 2: Jim Jones 2 1 FALSE FALSE
# 3: Joe Smith 1 1 FALSE FALSE
# 4: Joe Smith 1 1 FALSE FALSE
# 5: Ann Cotter 4 2 FALSE FALSE
# 6: Carol Klein Clymer 3 2 FALSE TRUE
# 7: Joe Smith 1 2 TRUE FALSE
# 8: Ann Cotter 4 3 TRUE FALSE
# 9: Beth Brown 5 3 FALSE FALSE
# 10: Joe Smith 1 4 TRUE FALSE
# 11: Joe Smith 1 4 TRUE FALSE
我认为这里的旗帜很乱;最好简单地识别
id
的来源:
dt[,c("flag1","flag2"):=NULL]
# create name -> id table
namemap <- unique(dt[,.(maiden_name,id,year),keyby=.(name,surname)],by=NULL)
# tag original ids
namemap[!is.na(id),src:="original"]
# carried over from earlier years
namemap[, has_oid := any(!is.na(id)), by=key(namemap)]
namemap[(has_oid),`:=`(
id = id[!is.na(id)],
src = ifelse(is.na(id), "history", src)
),by=.(name,surname)]
# carry over for surname changes on marriage
namemap[maiden_name!="",`:=`(
id = namemap[.BY]$id,
src = "maiden"
),by=.(name,maiden_name)]
# create new ids where none exists
namemap[is.na(id),`:=`(
id = .GRP+max(dt$id,na.rm=TRUE),
src = "new"
),by=.(name,surname)]
# copy back to the original table
setkey(dt,name,surname,year)
setkey(namemap,name,surname,year)
dt[namemap,`:=`(
id = i.id,
src = src
)]
数据的原始顺序将丢失,但如果需要,可以轻松恢复 我认为关键(不是双关语)是要意识到,对于丢失的ID,合并正在返回NA
,因此我应该在每个步骤,例如在步骤1,将标志添加到未匹配的
unmatched <- dt[.(1:(yr - 1L))
][!id %in% existing_ids,
.SD[.N], by = id][ , flag1 := TRUE]
dt[year == yr, c("id", "flag1") :=
unmatched[.SD, .(id, flag1), on = "name,surname"]]
剩下的一个问题是,一些应该是F
的标志已重置为NA
;如果能够设置nomatch=F
,那就太好了,但我不太担心这种副作用——对我来说,关键是知道每个标志何时是T
,所以基本上,我们会将我正在进行的合并的结果合并到原始表中?@MichaelChirico我已经更新了我的答案。这可能就是我要做的。我认为,没有必要提及年份。我担心我会因为过分简化了我的工作示例,或者因为我正在做一些更准确的事情而感到内疚。更新了,更复杂了;仍然比我实际要做的要简单,但是我想我现在已经有了所有必要的细微差别是的,代码从那时起已经清理了很多,但是你明白了,它是相当拜占庭式的。字符串数据让我做噩梦。。。
unmatched <- dt[.(1:(yr - 1L))
][!id %in% existing_ids,
.SD[.N], by = id][ , flag1 := TRUE]
dt[year == yr, c("id", "flag1") :=
unmatched[.SD, .(id, flag1), on = "name,surname"]]
> dt[ ]
name surname maiden_name id year flag1 flag2
1: Carol Clymer 3 1 FALSE FALSE
2: Jim Jones 2 1 FALSE FALSE
3: Joe Smith 1 1 FALSE FALSE
4: Joe Smith 1 1 FALSE FALSE
5: Ann Cotter 4 2 NA NA
6: Carol Klein Clymer 3 2 NA TRUE
7: Joe Smith 1 2 TRUE FALSE
8: Ann Cotter 4 3 TRUE FALSE
9: Beth Brown 5 3 NA NA
10: Joe Smith 1 4 TRUE FALSE
11: Joe Smith 1 4 TRUE FALSE