R中的模糊匹配;返回接近重复的用户标识符之间的公共字符串

R中的模糊匹配;返回接近重复的用户标识符之间的公共字符串,r,match,character,R,Match,Character,我正在处理过去几年收集的次要用户事件数据。提供商在某个时间点更改了系统,这样做会弄乱用户标识符列。在新系统中,在每个用户标识符的前面添加了一个前缀(长度未定义),这意味着一个用户有两个用户标识符。下面是一个模拟示例:- UserId<-c("+7450df38c6a2c2b18e06", "+6547e4e1645868458dcd", "+3fde2905308abe637fda", "+0d7c1f693fb

我正在处理过去几年收集的次要用户事件数据。提供商在某个时间点更改了系统,这样做会弄乱用户标识符列。在新系统中,在每个用户标识符的前面添加了一个前缀(长度未定义),这意味着一个用户有两个用户标识符。下面是一个模拟示例:-


UserId<-c("+7450df38c6a2c2b18e06", "+6547e4e1645868458dcd", "+3fde2905308abe637fda", 
  "+0d7c1f693fbde98e5214", "+059a31bfea92fae4d292", "+58de3eee8b7b0afef0bf", 
  "+01cdee6d0425f3184b1b", "+2e35e45b40213031e320", "+89de669da4accdf77c14", 
  "+80327216548b4d95fe05", "+8a47ddaace37c5a5870d", "+5415d85869372f40b6f5", 
  "+5f2a35a157cc7c2d1b09", "+e0c57b9d284cf300b12f", "+dc9412a08dc9e321c4ca", 
  "+2127450df38c6a2c2b18e06", "+2126547e4e1645868458dcd", "+21433fde2905308abe637fda", 
  "+2150d7c1f693fbde98e5214", "+215059a31bfea92fae4d292", "+215458de3eee8b7b0afef0bf", 
  "+215401cdee6d0425f3184b1b", "+2182e35e45b40213031e320", "+21889de669da4accdf77c14", 
  "+218880327216548b4d95fe05", "+21118a47ddaace37c5a5870d", "+2115415d85869372f40b6f5", 
  "+2105f2a35a157cc7c2d1b09", "+2100e0c57b9d284cf300b12f", "+244dc9412a08dc9e321c4ca"
)
UserId

UserId尝试以下解决方案:

设置Id的基本长度,例如21个字符:

basic_length<-21
to_compare<-substr(UserId,(nchar(UserId)-basic_length)+2,nchar(UserId))
下面是重复的基本Id列表:

substr(UserId[dup],(nchar(UserId)-basic_length)+2,nchar(UserId))
 [1] "2127450df38c6a2c2b18" "2126547e4e1645868458" "21433fde2905308abe63"
 [4] "2150d7c1f693fbde98e5" "215059a31bfea92fae4d" "215458de3eee8b7b0afe"
 [7] "215401cdee6d0425f318" "2182e35e45b40213031e" "21889de669da4accdf77"
[10] "218880327216548b4d95" "21118a47ddaace37c5a5" "2115415d85869372f40b"
[13] "2105f2a35a157cc7c2d1" "2100e0c57b9d284cf300" "244dc9412a08dc9e321c"
如果问题的目的只是为了拥有唯一的Id,则可以轻松使用此功能:

dup<-duplicated(to_compare)
paste("+",substr(UserId,(nchar(UserId)-basic_length)+2,nchar(UserId)),sep="")
 [1] "+7450df38c6a2c2b18e06" "+6547e4e1645868458dcd" "+3fde2905308abe637fda"
 [4] "+0d7c1f693fbde98e5214" "+059a31bfea92fae4d292" "+58de3eee8b7b0afef0bf"
 [7] "+01cdee6d0425f3184b1b" "+2e35e45b40213031e320" "+89de669da4accdf77c14"
[10] "+80327216548b4d95fe05" "+8a47ddaace37c5a5870d" "+5415d85869372f40b6f5"
[13] "+5f2a35a157cc7c2d1b09" "+e0c57b9d284cf300b12f" "+dc9412a08dc9e321c4ca"
[16] "+7450df38c6a2c2b18e06" "+6547e4e1645868458dcd" "+3fde2905308abe637fda"
[19] "+0d7c1f693fbde98e5214" "+059a31bfea92fae4d292" "+58de3eee8b7b0afef0bf"
[22] "+01cdee6d0425f3184b1b" "+2e35e45b40213031e320" "+89de669da4accdf77c14"
[25] "+80327216548b4d95fe05" "+8a47ddaace37c5a5870d" "+5415d85869372f40b6f5"
[28] "+5f2a35a157cc7c2d1b09" "+e0c57b9d284cf300b12f" "+dc9412a08dc9e321c4ca"
如您所见,位置[1]和[16]具有相同的值:

out[1]==out[16]
[1] TRUE

您是否签出了
fuzzyjoin
匹配包?您可以定义字符串的相似性/差异,如果您定义字符串之间的必要距离,这应该很容易,如果最后一个用户字符串在两个数据集之间保持不变,那么这应该始终是相同的。如果您要查找前缀或后缀,可以这样做。这非常有效,给了我一个通用ID;正是我想要的,谢谢!不客气!