String 读外文字_String_R_Encoding_Character Encoding_String Comparison

String 读外文字

string r encoding character-encoding

String 读外文字,string,r,encoding,character-encoding,string-comparison,String,R,Encoding,Character Encoding,String Comparison,我有一个包含英超球员姓名的数据库，我正在将其读入R（3.02），但当涉及到名字中有外国字符（umlauts、口音等）的球员时，我遇到了困难。下面的代码说明了这一点： PlayerData<-read.table("C:\\Users\\Documents\\Players.csv",quote=NULL, dec = ".",,sep=",", stringsAsFactors=F,header=T,fill=T,blank.lines.skip = TRUE) Test<-Play

我有一个包含英超球员姓名的数据库，我正在将其读入R（3.02），但当涉及到名字中有外国字符（umlauts、口音等）的球员时，我遇到了困难。下面的代码说明了这一点：

PlayerData<-read.table("C:\\Users\\Documents\\Players.csv",quote=NULL, dec = ".",,sep=",", stringsAsFactors=F,header=T,fill=T,blank.lines.skip = TRUE)
Test<-PlayerData[c(33655:33656),] #names of the players here are "Cazorla" "Özil"

Test[Test$Player=="Cazorla",] #Outputs correct details
Test[Test$Player=="Ozil",] # Can not find data '0 rows> (or 0-length row.names)'
<

#Example of how the foreign character is treated:
substr("Özil",1,1)
[1] "Ã"
substr("Özil",1,2)
[1] "Ö"
substr("Özil",2,2)
[1] "
substr("Özil",2,3)
[1] "z

PlayerData编辑：您提供的文件似乎使用了与系统本机不同的编码
由软件包中的stri_enc_detect
函数执行的（实验性）编码检测给出：
library('stringi')
PlayerDataRaw <- stri_read_raw('~/Desktop/PLAYERS.csv')
stri_enc_detect(PlayerDataRaw)
## [[1]]
## [[1]]$Encoding
## [1] "ISO-8859-1" "ISO-8859-2" "ISO-8859-9" "IBM424_rtl"
## 
## [[1]]$Language
## [1] "en" "ro" "tr" "he"
## 
## [[1]]$Confidence
## [1] 0.25 0.14 0.09 0.02

现在，您可以正确访问单个字符，例如使用stri\u sub
功能：
Test<-PlayerData[c(33655:33656),]
Test
##           T          Away H.A    Home  Player Year
## 33655 33654 CrystalPalace   1 Arsenal Cazorla 2013
## 33656 33655 CrystalPalace   1 Arsenal    Özil 2013

stri_sub(Test$Player, 1, length=1)
## [1] "C" "Ö"
stri_sub(Test$Player, 2, length=1)
## [1] "a" "z"

您还可以通过使用iconv
的音译器来消除重音字符（但我不确定它是否在Windows上可用）
或者使用软件包中非常强大的音译器（stringi版本>=0.2-2）：
谢谢大家在这方面的帮助
字符串已正确编码为UTF-8（我将参数添加到read.table
，并按照建议使用iconv
）。这似乎不是问题所在
我还使用了stri\u sub（）
函数。但这似乎也不起作用（它还将重音视为一个单独的字符stri_sub（“Özil”，1,3）=“Ãz”
）
但是，感谢您为我指明了stringi文档的方向，它为我提供了一个解决方案的想法，我很乐意使用：
remove.accents<-function(s){
oldrefs<-c(214,225)#Ö, á
newrefs<-c(79,97)#O,a

New<-utf8ToInt(s)
for(i in 1:length(oldrefs)){
New<-as.numeric(gsub(oldrefs[i],newrefs[i],New))
NEW<-intToUtf8(New)
}
NEW
}
> (remove.accents("Özil"))
[1] "Ozil"
> (remove.accents("Suárez"))
[1] "Suarez"

remove.accents能否将CSV的两行内容放到网络上的某个位置？也许iconv
能帮上忙。这需要一个更长的答案（大部分都超出了我的专业知识），但是试着把所有的东西都转换成UTF-8:Test$Player谢谢你指出了这个有趣的字符串处理任务，我很快就会开始在stringi中加入自动音译机制，查看stri_sub
是否工作不正常，我确信您的数据没有正确读取。调用Encoding（Test$Player）
的结果是什么？在导入之前，我已将Encoding='UTF-8'
参数包含到read.table
<代码>编码（Test$Player）
现在给我这个输出：未知“UTF-8”
（本例中未知的是Cazorla；第二个播放器UTF-8是Özil）。此外，传递UTF-8参数意味着Özil现在显示为zil
Hmm。。。你能在某处公开这个文件并在SO提供链接吗？请参阅我的更新答案，以获得（希望）完整的修复。伟大的答案-iconv的音译函数可以非常有效地对其进行排序。
stri_cmp_eq("Özil", "Ozil", stri_opts_collator(strength=1))
## [1] TRUE

iconv(Test$Player, 'latin1', 'ASCII//TRANSLIT')
## [1] "Cazorla" "Ozil"

stri_trans_general(Test$Player, 'Latin-ASCII')
## [1] "Cazorla" "Ozil"

remove.accents<-function(s){
oldrefs<-c(214,225)#Ö, á
newrefs<-c(79,97)#O,a

New<-utf8ToInt(s)
for(i in 1:length(oldrefs)){
New<-as.numeric(gsub(oldrefs[i],newrefs[i],New))
NEW<-intToUtf8(New)
}
NEW
}
> (remove.accents("Özil"))
[1] "Ozil"
> (remove.accents("Suárez"))
[1] "Suarez"