R:使用plyr在两个数据源的匹配子集之间执行模糊字符串匹配
假设我有一个县列表,其中有不同数量的拼写错误或其他问题,将其与(下面创建R:使用plyr在两个数据源的匹配子集之间执行模糊字符串匹配,r,plyr,dplyr,fuzzy-comparison,R,Plyr,Dplyr,Fuzzy Comparison,假设我有一个县列表,其中有不同数量的拼写错误或其他问题,将其与(下面创建fipsdataframe的代码)区分开来,但拼写错误的县所在的州是正确输入的。以下是我完整数据集中21个随机观察的样本: tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster", "Saint Joseph", "West Felicia
fips
dataframe的代码)区分开来,但拼写错误的县所在的州是正确输入的。以下是我完整数据集中21个随机观察的样本:
tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
"Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline",
"Richmond City", "Saint Mary", "Saint Louis City", "Mclean",
"Union", "Bienville", "Covington City", "Martinsville City",
"Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA",
"LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN",
"MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -21L))
county state
1 Beauregard LA
2 De Soto LA
3 Dekalb GA
4 Webster LA
5 Saint Joseph IN
6 West Feliciana LA
7 Ketchikan Gateway AK
8 Evangeline LA
9 Richmond City VA
10 Saint Mary LA
11 Saint Louis City MO
12 Mclean KY
13 Union LA
14 Bienville LA
15 Covington City VA
16 Martinsville City VA
17 Claiborne LA
18 King And Queen VA
19 Mclean ND
20 Mcminn TN
21 Prince Georges MD
因此,我想将county的模糊字符串匹配限制为拼写正确且状态匹配的版本。
我目前的算法制作一个大矩阵,计算两个源之间的标准Levenshtein距离,然后选择距离最小的值
为了解决我的问题,我猜我需要创建一个函数,该函数可以通过ddply
应用于每个“状态”组,但我不知道如何指示ddply
函数中的组值应该与另一个数据帧匹配。dplyr
解决方案或使用任何其他包的解决方案也将不胜感激
创建FIPS数据集的代码:
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
'./nationalfips.txt')
fips <- read.csv('./nationalfips.txt',
stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')
# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
“./nationalfips.txt”)
fips没有示例数据,但尝试使用agrep而不是adist,并仅搜索处于该状态的名称
sapply(df_tomatch$county, function(x) agrep(x,df_matchby[df_matchby$state==dj_tomatch[x,'state'],'county'],value=TRUE)
您可以使用agrep
中的max.distance
参数来改变它们需要匹配的距离。另外,设置value=TRUE
会返回匹配字符串的值,而不是匹配的位置。这里有一种使用dplyr的方法。我首先按状态将FIPS名称加入tomatch
data.frame(仅允许状态内匹配):
这两种情况的结果都是:
county county_clean state countyname dist string_agrep
1 Beauregard Beauregard LA Beauregard Parish 0 TRUE
2 De Soto De Soto LA De Soto Parish 0 TRUE
3 Dekalb Dekalb GA DeKalb 1 TRUE
4 Webster Webster LA Webster Parish 0 TRUE
5 Saint Joseph St. Joseph IN St. Joseph 0 TRUE
6 West Feliciana West Feliciana LA West Feliciana Parish 0 TRUE
7 Ketchikan Gateway Ketchikan Gateway AK Ketchikan Gateway Borough 0 TRUE
8 Evangeline Evangeline LA Evangeline Parish 0 TRUE
9 Richmond City Richmond City VA Richmond city 1 TRUE
10 Saint Mary St. Mary LA St. Mary Parish 0 TRUE
11 Saint Louis City St. Louis City MO St. Louis city 1 TRUE
12 Mclean Mclean KY McLean 1 TRUE
13 Union Union LA Union Parish 0 TRUE
14 Bienville Bienville LA Bienville Parish 0 TRUE
15 Covington City Covington City VA Covington city 1 TRUE
16 Martinsville City Martinsville City VA Martinsville city 1 TRUE
17 Claiborne Claiborne LA Claiborne Parish 0 TRUE
18 King And Queen King And Queen VA King and Queen 1 TRUE
19 Mclean Mclean ND McLean 1 TRUE
20 Mcminn Mcminn TN McMinn 1 TRUE
21 Prince Georges Prince Georges MD Prince George's 1 TRU
你的问题将从Hi@cole中受益匪浅,不幸的是这不起作用。我试图理解sapply
函数中agrep
的第二个参数。看起来df_tomatch$county的每个元素都被设置为要匹配的模式,但我不理解如何使用tomatch[x,'state']
作为行索引。谢谢。@mcjudd第二个参数是正在搜索模式的字符串。因此,它在df_matchby
中查找country
,但我将其子集设置为country
值,其中df_tomatch$state
与df_matchby$state
相同。这样,只能在共享相同州名称的正确县名称的子集中搜索县字符串的每个值。@mcjudd我刚刚使用您的数据尝试过,并发现了我的错误。以下操作应该有效:sapply(1:nrow(tomatch),函数(x)agrep(tomatch[x,'county'],fips[fips$state==tomatch[x,'state'],'countyname',value=TRUE,max.distance=0.3))
。这将为每场比赛列出一个县的列表,您可以只提取第一个县,这是最好的匹配。您可以微调max.distance
以获得最佳效果。感谢您的后续更正和解释!非常感谢你的详尽回答!对我的示例数据和完整数据集都非常有效!真不敢相信我所坚持的那一步竟如此简单,就像州政府的左联。再次感谢。
require(dplyr)
df <- tomatch %>%
left_join(fips, by="state")
df <- df %>%
mutate(county_clean = gsub("Saint", "St.", county))
df <- df %>%
group_by(county_clean) %>% # Calculate the distance per county
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
arrange(county, dist) # Used this for visual inspection.
df <- df %>%
rowwise() %>% # 'group_by' a single row.
mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() # Always a good idea to remove 'groups' after you're done.
df <- df %>%
group_by(county_clean) %>% # Causes it to calculate the 'min' per group
filter(dist == min(dist)) %>%
ungroup()
df <- tomatch %>%
# Join on all names in the relevant state and clean 'St.'
left_join(fips, by="state") %>%
mutate(county_clean = gsub("Saint", "St.", county)) %>%
# Calculate the distances, per original county name.
group_by(county_clean) %>%
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
# Append the agrepl result
rowwise() %>%
mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() %>%
# Only retain minimum distances
group_by(county_clean) %>%
filter(dist == min(dist))
county county_clean state countyname dist string_agrep
1 Beauregard Beauregard LA Beauregard Parish 0 TRUE
2 De Soto De Soto LA De Soto Parish 0 TRUE
3 Dekalb Dekalb GA DeKalb 1 TRUE
4 Webster Webster LA Webster Parish 0 TRUE
5 Saint Joseph St. Joseph IN St. Joseph 0 TRUE
6 West Feliciana West Feliciana LA West Feliciana Parish 0 TRUE
7 Ketchikan Gateway Ketchikan Gateway AK Ketchikan Gateway Borough 0 TRUE
8 Evangeline Evangeline LA Evangeline Parish 0 TRUE
9 Richmond City Richmond City VA Richmond city 1 TRUE
10 Saint Mary St. Mary LA St. Mary Parish 0 TRUE
11 Saint Louis City St. Louis City MO St. Louis city 1 TRUE
12 Mclean Mclean KY McLean 1 TRUE
13 Union Union LA Union Parish 0 TRUE
14 Bienville Bienville LA Bienville Parish 0 TRUE
15 Covington City Covington City VA Covington city 1 TRUE
16 Martinsville City Martinsville City VA Martinsville city 1 TRUE
17 Claiborne Claiborne LA Claiborne Parish 0 TRUE
18 King And Queen King And Queen VA King and Queen 1 TRUE
19 Mclean Mclean ND McLean 1 TRUE
20 Mcminn Mcminn TN McMinn 1 TRUE
21 Prince Georges Prince Georges MD Prince George's 1 TRU