R 基于多列和阈值合并数据帧
我有两个R 基于多列和阈值合并数据帧,r,dataframe,R,Dataframe,我有两个data.frames和多个公共列(这里:date,city,ctry,和(other)number) 现在,我希望将它们合并到上述列中,但允许存在一定程度的差异: threshold.numbers <- 3 threshold.date <- 5 # in days 现在,我想合并data.frames并接收一个df,如果满足上述条件,将合并行 (第一列仅为方便起见:在第一个数字后面,指示原始大小写,它显示合并的行(),还是来自df1(1)或df2(2) date c
data.frame
s和多个公共列(这里:date
,city
,ctry
,和(other
)number
)
现在,我希望将它们合并到上述列中,但允许存在一定程度的差异:
threshold.numbers <- 3
threshold.date <- 5 # in days
现在,我想合并data.frames
并接收一个df
,如果满足上述条件,将合并行
(第一列仅为方便起见:在第一个数字后面,指示原始大小写,它显示合并的行(
),还是来自df1
(1
)或df2
(2
)
date city ctry number other_col other_number other_col2#注释
1.2003-08-29德国柏林10苹果13黄色#日期、城市、号码匹配
2.1999年06月12日法国巴黎20根香蕉17根绿色#日期匹配,城市相似,编号-其他编号==阈值编号
31 2000-08-29英国伦敦30个梨不匹配:编号-其他编号>阈值。编号
32 2000-08-29英国伦敦附近3100蓝色#
41 1999-02-24罗马意大利40香蕉不匹配:编号-其他编号>阈值编号
42 1999-02-24意大利罗马45红色#
51 2001年04月17日瑞士伯尔尼50个柠檬不匹配:城市不同(日期可以,数字可以)
52 2001-04-17苏黎世瑞士51紫色#
6.1999-06-30丹麦哥本哈根60黄瓜61橙色匹配:日期差<阈值日期(城市可以,日期可以)
71 1999-03-16波兰华沙70苹果不匹配:编号-其他编号>阈值编号(日期确定)
72 1999-03-14波兰华沙780蓝色
81 1999-07-16莫斯科俄罗斯80桃不匹配:编号-其他编号>阈值。编号(日期确定)
82 1999-07-17俄罗斯莫斯科85红色#
91 2001-08-29突尼斯90樱桃不匹配:日期差异<阈值日期(城市可以,日期可以)
92 2000-01-29突尼斯突尼斯90黑色#
101 2002-07-30奥地利维也纳100樱桃不匹配:日期差<阈值日期(城市可以,日期可以)
102 2002-07-01奥地利维也纳101米色#
我尝试了合并它们的不同实现,但无法实现阈值
编辑
抱歉表述不清楚-我想保留所有行,并收到一个指标,该行是否匹配、不匹配和来自df1或不匹配和来自df2
伪代码是:
if there is a case where abs("date_df2" - "date_df1") <= threshold.date:
if "ctry_df2" == "ctry_df1":
if "city_df2" ~ "city_df1":
if abs("number_df2" - "number_df1") <= threshold.numbers:
merge and go to next row in df2
else:
add row to df1```
如果出现abs(“日期”-“日期”\df1”)步骤1:基于“城市”和“中心”合并数据的情况:
步骤2:如果日期项之间的差异>threshold.date(以天为单位),则删除行:
步骤3:如果数字之间的差异>Threshold.number,则删除行:
number_diff = abs(df$number - df$other_number)
index_remove = number_diff > threshold.numbers
df = df[-index_remove,]
在应用条件之前,应合并数据,以防行不匹配。我首先将城市名称转换为字符向量,因为(如果我理解正确的话)您希望包括df2中包含的城市名称
df1$city<-as.character(df1$city)
df2$city<-as.character(df2$city)
库stringr
将允许您在此处查看city.x是否位于city.y内(请参见最后一列):
但是我们要删除city.x在city.y中找不到的内容,其中日差大于5或数差大于3:
df<-df[df$dayDiff<=5 & df$numDiff<=3 & df$city_keep==TRUE,]
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
df使用data.table的选项(解释内联):
库(data.table)
setDT(df1)
setDT(df2)
#复制列并为非等联接创建范围
df1[,c(“n”、“ln”、“un”、“d”、“ld”、“ud”):=(
数字,数字-阈值。数字,数字+阈值。数字,
日期,日期-阈值。日期,日期+阈值。日期)]
df2[,c(“n”、“ln”、“un”、“d”、“ld”、“ud”):=(
其他_编号,其他_编号-threshold.numbers,其他_编号+threshold.numbers,
日期,日期-阈值。日期,日期+阈值。日期)]
#以两种方式使用ctry、num和dates执行非相等联接
res=ln,n=ld,d=ln,n=ld,d你可以用grepl
和ctry
测试city
匹配,用=
简单匹配。对于那些匹配到这里的人,你可以通过使用as.date
转换为日期,并将其与difftime
进行比较来计算日期差。数字
di冷漠也是如此
i1 <- seq_len(nrow(df1)) #Store all rows
i2 <- seq_len(nrow(df2))
res <- do.call(rbind, sapply(seq_len(nrow(df1)), function(i) { #Loop over all rows in df1
t1 <- which(df1$ctry[i] == df2$ctry) #Match ctry
t2 <- grepl(df1$city[i], df2$city[t1]) | sapply(df2$city[t1], grepl, df1$city[i]) #Match city
t1 <- t1[t2 & abs(as.Date(df1$date[i]) - as.Date(df2$date[t1[t2]])) <=
as.difftime(threshold.date, units = "days") & #Test for date difference
abs(df1$number[i] - df2$other_number[t1[t2]]) <= threshold.numbers] #Test for number difference
if(length(t1) > 0) { #Match found
i1 <<- i1[i1!=i] #Remove row as it was found
i2 <<- i2[i2!=t1]
cbind(df1[i,], df2[t1,c("other_number","other_col")], match=".")
}
}))
rbind(res
, cbind(df1[i1,], other_number=NA, other_col=NA, match="1")
, cbind(df2[i2,1:3], number=NA, col=NA, other_number=df2[i2,4]
, other_col=df2[i2,5], match="2"))
# date city ctry number col other_number other_col match
#1 2003-08-29 Berlin Germany 10 apple 13 yellow .
#2 1999-06-12 Paris France 20 banana 17 green .
#6 1999-06-30 Copenhagen Denmark 60 cucumber 61 orange .
#3 2000-08-29 London UK 30 pear NA <NA> 1
#4 1999-02-24 Rome Italy 40 banana NA <NA> 1
#5 2001-04-17 Bern Switzerland 50 lemon NA <NA> 1
#7 1999-03-16 Warsaw Poland 70 apple NA <NA> 1
#8 1999-07-16 Moscow Russia 80 peach NA <NA> 1
#9 2001-08-29 Tunis Tunisia 90 cherry NA <NA> 1
#10 2002-07-30 Vienna Austria 100 cherry NA <NA> 1
#31 2000-08-29 near London UK NA <NA> 3100 blue 2
#41 1999-02-24 Rome Italy NA <NA> 45 red 2
#51 2001-04-17 Zurich Switzerland NA <NA> 51 purple 2
#71 1999-03-14 Warsaw Poland NA <NA> 780 blue 2
#81 1999-07-17 Moscow Russia NA <NA> 85 red 2
#91 2000-01-29 Tunis Tunisia NA <NA> 90 black 2
#101 2002-07-01 Vienna Austria NA <NA> 101 beige 2
i1这里有一个解决方案,它使用我的包safejoin,在本例中包装包fuzzyjoin
我们可以使用by
参数指定一个复杂的条件,使用函数X()
从df1
获取值,使用函数Y()
从df2
获取值
如果您的实际表很大,这可能会很慢或不可能,因为它是笛卡尔积,但在这里它工作得很好
我们想要的是完全连接(保留所有行,并连接可以连接的行),我们想要在它们连接时保留第一个值,并以其他方式使用下一个值,这意味着我们想要通过合并处理同名列的冲突,因此我们使用参数conflict=dplyr::coalesce
#遥控器::安装_github(“moodymudskipper/safejoin”)
#日期是一个系数,t
number_diff = abs(df$number - df$other_number)
index_remove = number_diff > threshold.numbers
df = df[-index_remove,]
df1$city<-as.character(df1$city)
df2$city<-as.character(df2$city)
df = merge(df1, df2, by = ("ctry"))
> df
ctry date.x city.x number col date.y city.y other_number other_col
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue
library(stringr)
df$city_keep<-str_detect(df$city.y,df$city.x) # this returns logical vector if city.x is contained in city.y (works one way)
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE
df$dayDiff<-abs(as.POSIXlt(df$date.x)$yday - as.POSIXlt(df$date.y)$yday)
df$numDiff<-abs(df$number - df$other_number)
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE 29 1
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE 0 5
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE 2 710
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE 1 5
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE 0 1
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE 212 0
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE 0 3070
df<-df[df$dayDiff<=5 & df$numDiff<=3 & df$city_keep==TRUE,]
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
> df<-subset(df, select=-c(city.y, date.y, city_keep, dayDiff, numDiff))
> df
ctry date.x city.x number col other_number other_col
2 Denmark 1999-06-30 Copenhagen 60 cucumber 61 orange
3 France 1999-06-12 Paris 20 banana 17 green
4 Germany 2003-08-29 Berlin 10 apple 13 yellow
library(data.table)
setDT(df1)
setDT(df2)
#dupe columns and create ranges for non-equi joins
df1[, c("n", "ln", "un", "d", "ld", "ud") := .(
number, number - threshold.numbers, number + threshold.numbers,
date, date - threshold.date, date + threshold.date)]
df2[, c("n", "ln", "un", "d", "ld", "ud") := .(
other_number, other_number - threshold.numbers, other_number + threshold.numbers,
date, date - threshold.date, date + threshold.date)]
#perform non-equi join using ctry, num, dates in both ways
res <- rbindlist(list(
df1[df2, on=.(ctry, n>=ln, n<=un, d>=ld, d<=ud),
.(date1=x.date, date2=i.date, city1=x.city, city2=i.city, ctry1=x.ctry, ctry2=i.ctry, number, col, other_number, other_col)],
df2[df1, on=.(ctry, n>=ln, n<=un, d>=ld, d<=ud),
.(date1=i.date, date2=x.date, city1=i.city, city2=x.city, ctry1=i.ctry, ctry2=x.ctry, number, col, other_number, other_col)]),
use.names=TRUE, fill=TRUE)
#determine if cities are substrings of one and another
res[, city_match := {
i <- mapply(grepl, city1, city2) | mapply(grepl, city2, city1)
replace(i, is.na(i), TRUE)
}]
#just like SQL coalesce (there is a version in dev in rdatatable github)
coalesce <- function(...) Reduce(function(x, y) fifelse(!is.na(y), y, x), list(...))
#for rows that are matching or no matches to be found
ans1 <- unique(res[(city_match), .(date=coalesce(date1, date2),
city=coalesce(city1, city2),
ctry=coalesce(ctry1, ctry2),
number, col, other_number, other_col)])
#for rows that are close in terms of dates and numbers but are diff cities
ans2 <- res[(!city_match), .(date=c(.BY$date1, .BY$date2),
city=c(.BY$city1, .BY$city2),
ctry=c(.BY$ctry1, .BY$ctry2),
number=c(.BY$number, NA),
col=c(.BY$col, NA),
other_number=c(NA, .BY$other_number),
other_col=c(NA, .BY$other_col)),
names(res)][, seq_along(names(res)) := NULL]
#final desired output
setorder(rbindlist(list(ans1, ans2)), date, city, number, na.last=TRUE)[]
date city ctry number col other_number other_col
1: 1999-02-24 Rome Italy 40 banana NA <NA>
2: 1999-02-24 Rome Italy NA <NA> 45 red
3: 1999-03-14 Warsaw Poland NA <NA> 780 blue
4: 1999-03-16 Warsaw Poland 70 apple NA <NA>
5: 1999-06-12 East-Paris France 20 banana 17 green
6: 1999-06-29 Copenhagen Denmark 60 cucumber 61 orange
7: 1999-07-16 Moscow Russia 80 peach NA <NA>
8: 1999-07-17 Moscow Russia NA <NA> 85 red
9: 2000-01-29 Tunis Tunisia NA <NA> 90 black
10: 2000-08-29 London UK 30 pear NA <NA>
11: 2000-08-29 near London UK NA <NA> 3100 blue
12: 2001-04-17 Bern Switzerland 50 lemon NA <NA>
13: 2001-04-17 Zurich Switzerland NA <NA> 51 purple
14: 2001-08-29 Tunis Tunisia 90 cherry NA <NA>
15: 2002-07-01 Vienna Austria NA <NA> 101 beige
16: 2002-07-30 Vienna Austria 100 cherry NA <NA>
17: 2003-08-29 Berlin Germany 10 apple 13 yellow
i1 <- seq_len(nrow(df1)) #Store all rows
i2 <- seq_len(nrow(df2))
res <- do.call(rbind, sapply(seq_len(nrow(df1)), function(i) { #Loop over all rows in df1
t1 <- which(df1$ctry[i] == df2$ctry) #Match ctry
t2 <- grepl(df1$city[i], df2$city[t1]) | sapply(df2$city[t1], grepl, df1$city[i]) #Match city
t1 <- t1[t2 & abs(as.Date(df1$date[i]) - as.Date(df2$date[t1[t2]])) <=
as.difftime(threshold.date, units = "days") & #Test for date difference
abs(df1$number[i] - df2$other_number[t1[t2]]) <= threshold.numbers] #Test for number difference
if(length(t1) > 0) { #Match found
i1 <<- i1[i1!=i] #Remove row as it was found
i2 <<- i2[i2!=t1]
cbind(df1[i,], df2[t1,c("other_number","other_col")], match=".")
}
}))
rbind(res
, cbind(df1[i1,], other_number=NA, other_col=NA, match="1")
, cbind(df2[i2,1:3], number=NA, col=NA, other_number=df2[i2,4]
, other_col=df2[i2,5], match="2"))
# date city ctry number col other_number other_col match
#1 2003-08-29 Berlin Germany 10 apple 13 yellow .
#2 1999-06-12 Paris France 20 banana 17 green .
#6 1999-06-30 Copenhagen Denmark 60 cucumber 61 orange .
#3 2000-08-29 London UK 30 pear NA <NA> 1
#4 1999-02-24 Rome Italy 40 banana NA <NA> 1
#5 2001-04-17 Bern Switzerland 50 lemon NA <NA> 1
#7 1999-03-16 Warsaw Poland 70 apple NA <NA> 1
#8 1999-07-16 Moscow Russia 80 peach NA <NA> 1
#9 2001-08-29 Tunis Tunisia 90 cherry NA <NA> 1
#10 2002-07-30 Vienna Austria 100 cherry NA <NA> 1
#31 2000-08-29 near London UK NA <NA> 3100 blue 2
#41 1999-02-24 Rome Italy NA <NA> 45 red 2
#51 2001-04-17 Zurich Switzerland NA <NA> 51 purple 2
#71 1999-03-14 Warsaw Poland NA <NA> 780 blue 2
#81 1999-07-17 Moscow Russia NA <NA> 85 red 2
#91 2000-01-29 Tunis Tunisia NA <NA> 90 black 2
#101 2002-07-01 Vienna Austria NA <NA> 101 beige 2
#> date city ctry number col other_col
#> 1 2003-08-29 Berlin Germany 10 apple yellow
#> 2 1999-06-12 Paris France 20 banana green
#> 3 1999-06-30 Copenhagen Denmark 60 cucumber orange
#> 4 2000-08-29 London UK 30 pear <NA>
#> 5 1999-02-24 Rome Italy 40 banana <NA>
#> 6 2001-04-17 Bern Switzerland 50 lemon <NA>
#> 7 1999-03-16 Warsaw Poland 70 apple <NA>
#> 8 1999-07-16 Moscow Russia 80 peach <NA>
#> 9 2001-08-29 Tunis Tunisia 90 cherry <NA>
#> 10 2002-07-30 Vienna Austria 100 cherry <NA>
#> 11 2000-08-29 near London UK 3100 <NA> blue
#> 12 1999-02-24 Rome Italy 45 <NA> red
#> 13 2001-04-17 Zurich Switzerland 51 <NA> purple
#> 14 1999-03-14 Warsaw Poland 780 <NA> blue
#> 15 1999-07-17 Moscow Russia 85 <NA> red
#> 16 2000-01-29 Tunis Tunisia 90 <NA> black
#> 17 2002-07-01 Vienna Austria 101 <NA> beige
merge.criteria = list(
list(final.col.name = "date",
col.name.1 = "date",
col.name.2 = "date",
exact = F,
threshold = 5),
list(final.col.name = "city",
col.name.1 = "city",
col.name.2 = "city",
exact = F,
match.function = function(x, y) {
return(mapply(grepl, x, y) |
mapply(grepl, y, x))
}),
list(final.col.name = "ctry",
col.name.1 = "ctry",
col.name.2 = "ctry",
exact = T),
list(final.col.name = "number",
col.name.1 = "number",
col.name.2 = "other_number",
exact = F,
threshold = 3)
)
library(dplyr)
merge.data.frames = function(df1, df2, merge.criteria) {
# Create a data frame with all possible pairs of rows from df1 and rows from
# df2.
row.decisions = expand.grid(df1.row = 1:nrow(df1), df2.row = 1:nrow(df2))
# Iterate over the criteria in merge.criteria. For each criterion, flag row
# pairs that don't meet the criterion.
row.decisions$merge = T
for(criterion in merge.criteria) {
# If we're looking for an exact match, test for equality.
if(criterion$exact) {
row.decisions$merge = row.decisions$merge &
df1[row.decisions$df1.row,criterion$col.name.1] == df2[row.decisions$df2.row,criterion$col.name.2]
}
# If we're doing a threshhold test, test for difference.
else if(!is.null(criterion$threshold)) {
row.decisions$merge = row.decisions$merge &
abs(df1[row.decisions$df1.row,criterion$col.name.1] - df2[row.decisions$df2.row,criterion$col.name.2]) <= criterion$threshold
}
# If the user provided a function, use that.
else if(!is.null(criterion$match.function)) {
row.decisions$merge = row.decisions$merge &
criterion$match.function(df1[row.decisions$df1.row,criterion$col.name.1],
df2[row.decisions$df2.row,criterion$col.name.2])
}
}
# Create the new dataframe. Just row numbers of the source dfs to start.
new.df = bind_rows(
# Merged rows.
row.decisions %>% filter(merge) %>% select(-merge),
# Rows from df1 only.
row.decisions %>% group_by(df1.row) %>% summarize(matches = sum(merge)) %>% filter(matches == 0) %>% select(df1.row),
# Rows from df2 only.
row.decisions %>% group_by(df2.row) %>% summarize(matches = sum(merge)) %>% filter(matches == 0) %>% select(df2.row)
)
# Iterate over the merge criteria and add columns that were used for matching
# (from df1 if available; otherwise from df2).
for(criterion in merge.criteria) {
new.df[criterion$final.col.name] = coalesce(df1[new.df$df1.row,criterion$col.name.1],
df2[new.df$df2.row,criterion$col.name.2])
}
# Now add all the columns from either data frame that weren't used for
# matching.
for(other.col in setdiff(colnames(df1),
sapply(merge.criteria, function(x) x$col.name.1))) {
new.df[other.col] = df1[new.df$df1.row,other.col]
}
for(other.col in setdiff(colnames(df2),
sapply(merge.criteria, function(x) x$col.name.2))) {
new.df[other.col] = df2[new.df$df2.row,other.col]
}
# Return the result.
return(new.df)
}
df = merge.data.frames(df1, df2, merge.criteria)