R语言中的混淆字符编码
我有两个数据帧,由于数据保密,我无法完全共享。我应该使用LABEL变量合并它们,LABEL变量存在于两个数据集中,并包含一些Unicode字符,如č、ž等。然而,合并过程产生的行比预期的多,在进一步检查中,我发现在第一个数据帧中,包含Unicode字符的值被逐字转录(例如,您可以在数据帧中看到标签R语言中的混淆字符编码,r,character-encoding,R,Character Encoding,我有两个数据帧,由于数据保密,我无法完全共享。我应该使用LABEL变量合并它们,LABEL变量存在于两个数据集中,并包含一些Unicode字符,如č、ž等。然而,合并过程产生的行比预期的多,在进一步检查中,我发现在第一个数据帧中,包含Unicode字符的值被逐字转录(例如,您可以在数据帧中看到标签VŽ),而在第二个数据帧中,标签通过其Unicode代码显示,因此,您将看到V\u008e,而不是VŽ。我在两个数据帧上都使用了stri_enc_mark函数,下面是数据帧1的代码和输出: stri_e
VŽ
),而在第二个数据帧中,标签通过其Unicode代码显示,因此,您将看到V\u008e
,而不是VŽ
。我在两个数据帧上都使用了stri_enc_mark
函数,下面是数据帧1的代码和输出:
stri_enc_mark(unique(data1$Label)) %>% cbind(unique(data1$Label))
输出:
.
[1,] "ASCII" "ZD"
[2,] "ASCII" "RI"
[3,] "ASCII" "PU"
[4,] "ASCII" "ZG"
[5,] "ASCII" "DU"
[6,] NA NA
[7,] "ASCII" "KR"
[8,] "ASCII" "DA"
[9,] "ASCII" "MA"
[10,] "ASCII" "ST"
[11,] "UTF-8" "VŽ"
[12,] "ASCII" "KA"
[13,] "ASCII" "SB"
[14,] "ASCII" "BM"
[15,] "ASCII" "VT"
[16,] "ASCII" "BJ"
[17,] "ASCII" "DJ"
[18,] "ASCII" "OS"
[19,] "ASCII" "SK"
[20,] "ASCII" "GS"
[21,] "UTF-8" "PŽ"
[22,] "UTF-8" "ŠI"
[23,] "UTF-8" "KŽ"
[24,] "ASCII" "Vk"
[25,] "UTF-8" "ŽU"
[26,] "ASCII" "KC"
[27,] "ASCII" "DE"
[28,] "ASCII" "NA"
[29,] "UTF-8" "ČK"
[30,] "ASCII" "KT"
[31,] "ASCII" "IM"
[32,] "ASCII" "VU"
[33,] "ASCII" "NG"
[34,] "ASCII" "VK"
[35,] "ASCII" "OG"
[36,] "ASCII" "SL"
.
[1,] "ASCII" "BJ"
[2,] "ASCII" "BM"
[3,] "UTF-8" "\xc8K"
[4,] "ASCII" "DA"
[5,] "ASCII" "DE"
[6,] "ASCII" "DJ"
[7,] "ASCII" "DU"
[8,] "ASCII" "GS"
[9,] "ASCII" "IM"
[10,] "ASCII" "KA"
[11,] "ASCII" "KC"
[12,] "ASCII" "KR"
[13,] "ASCII" "KT"
[14,] "UTF-8" "K\u008e"
[15,] "ASCII" "MA"
[16,] "ASCII" "NA"
[17,] "ASCII" "NG"
[18,] "ASCII" "OG"
[19,] "ASCII" "OS"
[20,] "ASCII" "PU"
[21,] "UTF-8" "P\u008e"
[22,] "ASCII" "RI"
[23,] "ASCII" "SB"
[24,] "ASCII" "SK"
[25,] "ASCII" "ST"
[26,] "UTF-8" "\u008aI"
[27,] "ASCII" "VK"
[28,] "ASCII" "VU"
[29,] "UTF-8" "V\u008e"
[30,] "ASCII" "ZD"
[31,] "ASCII" "ZG"
[32,] "UTF-8" "\u008eU"
[33,] "ASCII" "VT"
对于数据帧2:
stri_enc_mark(unique(data2$Label)) %>% cbind(unique(data2$Label))
输出:
.
[1,] "ASCII" "ZD"
[2,] "ASCII" "RI"
[3,] "ASCII" "PU"
[4,] "ASCII" "ZG"
[5,] "ASCII" "DU"
[6,] NA NA
[7,] "ASCII" "KR"
[8,] "ASCII" "DA"
[9,] "ASCII" "MA"
[10,] "ASCII" "ST"
[11,] "UTF-8" "VŽ"
[12,] "ASCII" "KA"
[13,] "ASCII" "SB"
[14,] "ASCII" "BM"
[15,] "ASCII" "VT"
[16,] "ASCII" "BJ"
[17,] "ASCII" "DJ"
[18,] "ASCII" "OS"
[19,] "ASCII" "SK"
[20,] "ASCII" "GS"
[21,] "UTF-8" "PŽ"
[22,] "UTF-8" "ŠI"
[23,] "UTF-8" "KŽ"
[24,] "ASCII" "Vk"
[25,] "UTF-8" "ŽU"
[26,] "ASCII" "KC"
[27,] "ASCII" "DE"
[28,] "ASCII" "NA"
[29,] "UTF-8" "ČK"
[30,] "ASCII" "KT"
[31,] "ASCII" "IM"
[32,] "ASCII" "VU"
[33,] "ASCII" "NG"
[34,] "ASCII" "VK"
[35,] "ASCII" "OG"
[36,] "ASCII" "SL"
.
[1,] "ASCII" "BJ"
[2,] "ASCII" "BM"
[3,] "UTF-8" "\xc8K"
[4,] "ASCII" "DA"
[5,] "ASCII" "DE"
[6,] "ASCII" "DJ"
[7,] "ASCII" "DU"
[8,] "ASCII" "GS"
[9,] "ASCII" "IM"
[10,] "ASCII" "KA"
[11,] "ASCII" "KC"
[12,] "ASCII" "KR"
[13,] "ASCII" "KT"
[14,] "UTF-8" "K\u008e"
[15,] "ASCII" "MA"
[16,] "ASCII" "NA"
[17,] "ASCII" "NG"
[18,] "ASCII" "OG"
[19,] "ASCII" "OS"
[20,] "ASCII" "PU"
[21,] "UTF-8" "P\u008e"
[22,] "ASCII" "RI"
[23,] "ASCII" "SB"
[24,] "ASCII" "SK"
[25,] "ASCII" "ST"
[26,] "UTF-8" "\u008aI"
[27,] "ASCII" "VK"
[28,] "ASCII" "VU"
[29,] "UTF-8" "V\u008e"
[30,] "ASCII" "ZD"
[31,] "ASCII" "ZG"
[32,] "UTF-8" "\u008eU"
[33,] "ASCII" "VT"
就我所见,“文字”标签和Unicode代码标签都被编码为UTF-8,这让我感到惊讶,因为如果是这样,我无法理解为什么一个数据帧显示VŽ,另一个显示V\u008e
我想将编码标签转换为文字标签,我已尝试以下操作:
data2 %>%
mutate(Label = recode(Label, "\xc8K" = "ČK",
"K\u008e" = "KŽ",
"P\u008e" = "PŽ",
"\u008aI" = "ŠI",
"V\u008e" = "VŽ",
"\u008eU" = "ŽU"))
但这没有成功,我收到以下警告:
Warning messages:
1: unable to translate 'K<U+008E>' to native encoding
2: unable to translate 'P<U+008E>' to native encoding
3: unable to translate '<U+008A>I' to native encoding
4: unable to translate 'V<U+008E>' to native encoding
5: unable to translate '<U+008E>U' to native encoding
警告消息:
1:无法将“K”转换为本机编码
2:无法将“P”转换为本机编码
3:无法将“I”转换为本机编码
4:无法将“V”转换为本机编码
5:无法将“U”转换为本机编码
那么,如何正确地重新编码这些值呢?\u008E
是单移位二,而Ž
(带Caron的拉丁文大写字母Z)是unicode\u017D
;然而,在美国和欧盟西部的ACP1252和中欧的ACP1250中,Ž
是\x8E
。