R 如何对模糊字段名进行合并?

R 如何对模糊字段名进行合并?,r,R,我有一些不同的数据集,其中包含不一致的国家名称。我想对国家名称进行模糊合并 所以,我有伊朗(I.R.)和伊朗,伊斯兰共和国,我希望他们在“合并”或“联合”中是对等的 我可以容忍匹配中的错误,我只想在不做太多工作的情况下得到改进 [1] "Afghanistan" [2] "Albania" [3] "Algeria

我有一些不同的数据集,其中包含不一致的国家名称。我想对国家名称进行模糊合并

所以,我有伊朗(I.R.)和伊朗,伊斯兰共和国,我希望他们在“合并”或“联合”中是对等的

我可以容忍匹配中的错误,我只想在不做太多工作的情况下得到改进

  [1] "Afghanistan"                                         
  [2] "Albania"                                             
  [3] "Algeria"                                             
  [4] "American Samoa"                                      
  [5] "Andorra"                                             
  [6] "Angola"                                              
  [7] "Anguilla"                                            
  [8] "Antigua and Barbuda"                                 
  [9] "Antigua & Barbuda"                                   
 [10] "Arab World"                                          
 [11] "Argentina"                                           
 [12] "Armenia"                                             
 [13] "Aruba"                                               
 [14] "Ascension"                                           
 [15] "Australia"                                           
 [16] "Austria"                                             
 [17] "Azerbaijan"                                          
 [18] "Bahamas"                                             
 [19] "Bahamas, The"                                        
 [20] "Bahrain"                                             
 [21] "Bangladesh"                                          
 [22] "Barbados"                                            
 [23] "Belarus"                                             
 [24] "Belgium"                                             
 [25] "Belize"                                              
 [26] "Benin"                                               
 [27] "Bermuda"                                             
 [28] "Bhutan"                                              
 [29] "Bolivia"                                             
 [30] "Bosnia and Herzegovina"                              
 [31] "Botswana"                                            
 [32] "Brazil"                                              
 [33] "British Virgin Islands"                              
 [34] "Brunei Darussalam"                                   
 [35] "Bulgaria"                                            
 [36] "Burkina Faso"                                        
 [37] "Burundi"                                             
 [38] "Cabo Verde"                                          
 [39] "Cambodia"                                            
 [40] "Cameroon"                                            
 [41] "Canada"                                              
 [42] "Cape Verde"                                          
 [43] "Caribbean small states"                              
 [44] "Cayman Islands"                                      
 [45] "Central African Rep."                                
 [46] "Central African Republic"                            
 [47] "Chad"                                                
 [48] "Channel Islands"                                     
 [49] "Chile"                                               
 [50] "China"                                               
 [51] "Cocos Keeling Islands"                               
 [52] "Colombia"                                            
 [53] "Comoros"                                             
 [54] "Congo"                                               
 [55] "Congo (Dem. Rep.)"                                   
 [56] "Congo, Dem. Rep."                                    
 [57] "Congo, Rep."                                         
 [58] "Costa Rica"                                          
 [59] "Cote d'Ivoire"                                       
 [60] "Côte d'Ivoire"                                       
 [61] "Croatia"                                             
 [62] "Cuba"                                                
 [63] "Curacao"                                             
 [64] "Cyprus"                                              
 [65] "Czech Republic"                                      
 [66] "Denmark"                                             
 [67] "Djibouti"                                            
 [68] "Dominica"                                            
 [69] "Dominican Rep."                                      
 [70] "Dominican Republic"                                  
 [71] "D.P.R. Korea"                                        
 [72] "East Asia and the Pacific (IFC classification)"      
 [73] "East Asia & Pacific (all income levels)"             
 [74] "East Asia & Pacific (developing only)"               
 [75] "Ecuador"                                             
 [76] "Egypt"                                               
 [77] "Egypt, Arab Rep."                                    
 [78] "El Salvador"                                         
 [79] "Equatorial Guinea"                                   
 [80] "Eritrea"                                             
 [81] "Estonia"                                             
 [82] "Ethiopia"                                            
 [83] "Euro area"                                           
 [84] "Europe and Central Asia (IFC classification)"        
 [85] "European Union"                                      
 [86] "Europe & Central Asia (all income levels)"           
 [87] "Europe & Central Asia (developing only)"             
 [88] "Faeroe Islands"                                      
 [89] "Falkland (Malvinas) Is."                             
 [90] "Faroe Islands"                                       
 [91] "Fiji"                                                
 [92] "Finland"                                             
 [93] "France"                                              
 [94] "French Polynesia"                                    
 [95] "Gabon"                                               
 [96] "Gambia"                                              
 [97] "Gambia, The"                                         
 [98] "Georgia"                                             
 [99] "Germany"                                             
[100] "Ghana"                                               
[101] "Gibraltar"                                           
[102] "Greece"                                              
[103] "Greenland"                                           
[104] "Grenada"                                             
[105] "Guam"                                                
[106] "Guatemala"                                           
[107] "Guernsey"                                            
[108] "Guinea"                                              
[109] "Guinea-Bissau"                                       
[110] "Guyana"                                              
[111] "Haiti"                                               
[112] "Heavily indebted poor countries (HIPC)"              
[113] "High income"                                         
[114] "High income: nonOECD"                                
[115] "High income: OECD"                                   
[116] "Honduras"                                            
[117] "Hong Kong, China"                                    
[118] "Hong Kong SAR, China"                                
[119] "Hungary"                                             
[120] "Iceland"                                             
[121] "India"                                               
[122] "Indonesia"                                           
[123] "Iran (I.R.)"                                         
[124] "Iran, Islamic Rep."                                  
[125] "Iraq"                                                
[126] "Ireland"                                             
[127] "Isle of Man"                                         
[128] "Israel"                                              
[129] "Italy"                                               
[130] "Jamaica"                                             
[131] "Japan"                                               
[132] "Jersey"                                              
[133] "Jordan"                                              
[134] "Kazakhstan"                                          
[135] "Kenya"                                               
[136] "Kiribati"                                            
[137] "Korea, Dem. Rep."                                    
[138] "Korea (Rep.)"                                        
[139] "Korea, Rep."                                         
[140] "Kosovo"                                              
[141] "Kuwait"                                              
[142] "Kyrgyz Republic"                                     
[143] "Kyrgyzstan"                                          
[144] "Lao PDR"                                             
[145] "Lao P.D.R."                                          
[146] "Latin America and the Caribbean (IFC classification)"
[147] "Latin America & Caribbean (all income levels)"       
[148] "Latin America & Caribbean (developing only)"         
[149] "Latvia"                                              
[150] "Least developed countries: UN classification"        
[151] "Lebanon"                                             
[152] "Lesotho"                                             
[153] "Liberia"                                             
[154] "Libya"                                               
[155] "Liechtenstein"                                       
[156] "Lithuania"                                           
[157] "Lower middle income"                                 
[158] "Low income"                                          
[159] "Low & middle income"                                 
[160] "Luxembourg"                                          
[161] "Macao, China"                                        
[162] "Macao SAR, China"                                    
[163] "Macedonia, FYR"                                      
[164] "Madagascar"                                          
[165] "Malawi"                                              
[166] "Malaysia"                                            
[167] "Maldives"                                            
[168] "Mali"                                                
[169] "Malta"                                               
[170] "Marshall Islands"                                    
[171] "Mauritania"                                          
[172] "Mauritius"                                           
[173] "Mayotte"                                             
[174] "Mexico"                                              
[175] "Micronesia"                                          
[176] "Micronesia, Fed. Sts."                               
[177] "Middle East and North Africa (IFC classification)"   
[178] "Middle East & North Africa (all income levels)"      
[179] "Middle East & North Africa (developing only)"        
[180] "Middle income"                                       
[181] "Moldova"                                             
[182] "Monaco"                                              
[183] "Mongolia"                                            
[184] "Montenegro"                                          
[185] "Montserrat"                                          
[186] "Morocco"                                             
[187] "Mozambique"                                          
[188] "Myanmar"                                             
[189] "Namibia"                                             
[190] "Nauru"                                               
[191] "Nepal"                                               
[192] "Neth. Antilles"                                      
[193] "Netherlands"                                         
[194] "New Caledonia"                                       
[195] "New Zealand"                                         
[196] "Nicaragua"                                           
[197] "Niger"                                               
[198] "Nigeria"                                             
[199] "Niue"                                                
[200] "Norfolk Islands"                                     
[201] "North America"                                       
[202] "Northern Mariana Islands"                            
[203] "Northern Marianas"                                   
[204] "Norway"                                              
[205] "Not classified"                                      
[206] "OECD members"                                        
[207] "Oman"                                                
[208] "Other small states"                                  
[209] "Pacific island small states"                         
[210] "Pakistan"                                            
[211] "Palau"                                               
[212] "Palestinian Authority"                               
[213] "Panama"                                              
[214] "Papua New Guinea"                                    
[215] "Paraguay"                                            
[216] "Peru"                                                
[217] "Philippines"                                         
[218] "Poland"                                              
[219] "Portugal"                                            
[220] "Puerto Rico"                                         
[221] "Qatar"                                               
[222] "Romania"                                             
[223] "Russia"                                              
[224] "Russian Federation"                                  
[225] "Rwanda"                                              
[226] "Samoa"                                               
[227] "San Marino"                                          
[228] "Sao Tome and Principe"                               
[229] "Saudi Arabia"                                        
[230] "Senegal"                                             
[231] "Serbia"                                              
[232] "Seychelles"                                          
[233] "Sierra Leone"                                        
[234] "Singapore"                                           
[235] "Sint Maarten (Dutch part)"                           
[236] "Slovak Republic"                                     
[237] "Slovenia"                                            
[238] "Small states"                                        
[239] "Solomon Islands"                                     
[240] "Somalia"                                             
[241] "South Africa"                                        
[242] "South Asia"                                          
[243] "South Asia (IFC classification)"                     
[244] "South Sudan"                                         
[245] "Spain"                                               
[246] "Sri Lanka"                                           
[247] "St. Helena"                                          
[248] "St. Kitts and Nevis"                                 
[249] "St. Lucia"                                           
[250] "St. Martin (French part)"                            
[251] "S. Tomé & Principe"                                  
[252] "St. Pierre & Miquelon"                               
[253] "St. Vincent and the Grenadines"                      
[254] "Sub-Saharan Africa (all income levels)"              
[255] "Sub-Saharan Africa (developing only)"                
[256] "Sub-Saharan Africa (IFC classification)"             
[257] "Sudan"                                               
[258] "Suriname"                                            
[259] "Swaziland"                                           
[260] "Sweden"                                              
[261] "Switzerland"                                         
[262] "Syria"                                               
[263] "Syrian Arab Republic"                                
[264] "Taiwan, Province of China"                           
[265] "Tajikistan"                                          
[266] "Tanzania"                                            
[267] "TFYR Macedonia"                                      
[268] "Thailand"                                            
[269] "Timor-Leste"                                         
[270] "Togo"                                                
[271] "Tokelau"                                             
[272] "Tonga"                                               
[273] "Trinidad and Tobago"                                 
[274] "Trinidad & Tobago"                                   
[275] "Tunisia"                                             
[276] "Turkey"                                              
[277] "Turkmenistan"                                        
[278] "Turks and Caicos Islands"                            
[279] "Turks & Caicos Is."                                  
[280] "Tuvalu"                                              
[281] "Uganda"                                              
[282] "Ukraine"                                             
[283] "United Arab Emirates"                                
[284] "United Kingdom"                                      
[285] "United States"                                       
[286] "Upper middle income"                                 
[287] "Uruguay"                                             
[288] "Uzbekistan"                                          
[289] "Vanuatu"                                             
[290] "Vatican"                                             
[291] "Venezuela"                                           
[292] "Venezuela, RB"                                       
[293] "Viet Nam"                                            
[294] "Vietnam"                                             
[295] "Virgin Islands (U.S.)"                               
[296] "Virgin Islands (US)"                                 
[297] "Wallis and Futuna"                                   
[298] "West Bank and Gaza"                                  
[299] "World"                                               
[300] "Yemen"                                               
[301] "Yemen, Rep."                                         
[302] "Zambia"                                              
[303] "Zimbabwe"                                            

编辑:数据集来自两个来源。名称在源代码中是一致的,但在源代码之间是不一致的。

我应该说,这不是一个模糊匹配解决方案。这是一个“只做一次工作,永远不要再想它”的解决方案

一般来说,尤其是如果我必须做这种类型的操作,我通常会使用以下步骤。这个过程对于特定行业内的公司名称也非常有效(我对加拿大/美国/欧洲金融产品制造商使用它)

  • 规范化字符串(小写、带白色、带特殊字符)
  • 按字母顺序替换
  • 调整为不匹配
  • m
    成为您的国家名称向量

    m <- as.character(m) # convert to character
    m <- gsub("."," ",m) # remove "."
    m <- gsub(","," ",m) # remove comma (and so on)
    m <- tolower(m) # might fail if you have lots of special characters
    m <- gsub("\\s+|\\s+$","",m) # strip whitespace
    

    m你能举一个所有国家名称的例子吗?在合并之前,匹配和重命名它们可能会更容易。请尝试[PDF]包进行近似字符串匹配。@Gary,有问题的案例不到12个,所以我可能可以通过hand@user492922您不必手工操作,但是查看不同的字符串可能会提供对自动化策略的深入了解。自己控制名称是最安全的。Stringdist将是一个很好的解决方法-或者,如果您使用的是R3.1.0,我们现在有内置的模糊匹配?agrep:)
    
    m[grep("afghanist")] <- "Afghanistan"
    m[grep("alban")] <- "Albania"
    ...
    m[grep("iran")] <- "Islamic Republic of Iran"
    ...
    m[grep("usa")] <- "United States of America"
    m[grep("yemen")] <- "Yemen"
    
    verbatims <- m
    
    # Unmatched = anything without a capital
    unmatched <- which(!substr(m,1,1) %in% LETTERS[1:26])
    
    unmatched <- m[unmatched]
    verbatims[unmatched] <- "Other" # Or however you need to recode it