Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/76.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R语言中的混淆字符编码_R_Character Encoding - Fatal编程技术网

R语言中的混淆字符编码

R语言中的混淆字符编码,r,character-encoding,R,Character Encoding,我有两个数据帧,由于数据保密,我无法完全共享。我应该使用LABEL变量合并它们,LABEL变量存在于两个数据集中,并包含一些Unicode字符,如č、ž等。然而,合并过程产生的行比预期的多,在进一步检查中,我发现在第一个数据帧中,包含Unicode字符的值被逐字转录(例如,您可以在数据帧中看到标签VŽ),而在第二个数据帧中,标签通过其Unicode代码显示,因此,您将看到V\u008e,而不是VŽ。我在两个数据帧上都使用了stri_enc_mark函数,下面是数据帧1的代码和输出: stri_e

我有两个数据帧,由于数据保密,我无法完全共享。我应该使用LABEL变量合并它们,LABEL变量存在于两个数据集中,并包含一些Unicode字符,如č、ž等。然而,合并过程产生的行比预期的多,在进一步检查中,我发现在第一个数据帧中,包含Unicode字符的值被逐字转录(例如,您可以在数据帧中看到标签
),而在第二个数据帧中,标签通过其Unicode代码显示,因此,您将看到
V\u008e
,而不是
。我在两个数据帧上都使用了
stri_enc_mark
函数,下面是数据帧1的代码和输出:

stri_enc_mark(unique(data1$Label)) %>% cbind(unique(data1$Label))
输出:

      .           
 [1,] "ASCII" "ZD"
 [2,] "ASCII" "RI"
 [3,] "ASCII" "PU"
 [4,] "ASCII" "ZG"
 [5,] "ASCII" "DU"
 [6,] NA      NA  
 [7,] "ASCII" "KR"
 [8,] "ASCII" "DA"
 [9,] "ASCII" "MA"
[10,] "ASCII" "ST"
[11,] "UTF-8" "VŽ"
[12,] "ASCII" "KA"
[13,] "ASCII" "SB"
[14,] "ASCII" "BM"
[15,] "ASCII" "VT"
[16,] "ASCII" "BJ"
[17,] "ASCII" "DJ"
[18,] "ASCII" "OS"
[19,] "ASCII" "SK"
[20,] "ASCII" "GS"
[21,] "UTF-8" "PŽ"
[22,] "UTF-8" "ŠI"
[23,] "UTF-8" "KŽ"
[24,] "ASCII" "Vk"
[25,] "UTF-8" "ŽU"
[26,] "ASCII" "KC"
[27,] "ASCII" "DE"
[28,] "ASCII" "NA"
[29,] "UTF-8" "ČK"
[30,] "ASCII" "KT"
[31,] "ASCII" "IM"
[32,] "ASCII" "VU"
[33,] "ASCII" "NG"
[34,] "ASCII" "VK"
[35,] "ASCII" "OG"
[36,] "ASCII" "SL"
      .                
 [1,] "ASCII" "BJ"     
 [2,] "ASCII" "BM"     
 [3,] "UTF-8" "\xc8K"  
 [4,] "ASCII" "DA"     
 [5,] "ASCII" "DE"     
 [6,] "ASCII" "DJ"     
 [7,] "ASCII" "DU"     
 [8,] "ASCII" "GS"     
 [9,] "ASCII" "IM"     
[10,] "ASCII" "KA"     
[11,] "ASCII" "KC"     
[12,] "ASCII" "KR"     
[13,] "ASCII" "KT"     
[14,] "UTF-8" "K\u008e"
[15,] "ASCII" "MA"     
[16,] "ASCII" "NA"     
[17,] "ASCII" "NG"     
[18,] "ASCII" "OG"     
[19,] "ASCII" "OS"     
[20,] "ASCII" "PU"     
[21,] "UTF-8" "P\u008e"
[22,] "ASCII" "RI"     
[23,] "ASCII" "SB"     
[24,] "ASCII" "SK"     
[25,] "ASCII" "ST"     
[26,] "UTF-8" "\u008aI"
[27,] "ASCII" "VK"     
[28,] "ASCII" "VU"     
[29,] "UTF-8" "V\u008e"
[30,] "ASCII" "ZD"     
[31,] "ASCII" "ZG"     
[32,] "UTF-8" "\u008eU"
[33,] "ASCII" "VT"  
对于数据帧2:

stri_enc_mark(unique(data2$Label)) %>% cbind(unique(data2$Label))
输出:

      .           
 [1,] "ASCII" "ZD"
 [2,] "ASCII" "RI"
 [3,] "ASCII" "PU"
 [4,] "ASCII" "ZG"
 [5,] "ASCII" "DU"
 [6,] NA      NA  
 [7,] "ASCII" "KR"
 [8,] "ASCII" "DA"
 [9,] "ASCII" "MA"
[10,] "ASCII" "ST"
[11,] "UTF-8" "VŽ"
[12,] "ASCII" "KA"
[13,] "ASCII" "SB"
[14,] "ASCII" "BM"
[15,] "ASCII" "VT"
[16,] "ASCII" "BJ"
[17,] "ASCII" "DJ"
[18,] "ASCII" "OS"
[19,] "ASCII" "SK"
[20,] "ASCII" "GS"
[21,] "UTF-8" "PŽ"
[22,] "UTF-8" "ŠI"
[23,] "UTF-8" "KŽ"
[24,] "ASCII" "Vk"
[25,] "UTF-8" "ŽU"
[26,] "ASCII" "KC"
[27,] "ASCII" "DE"
[28,] "ASCII" "NA"
[29,] "UTF-8" "ČK"
[30,] "ASCII" "KT"
[31,] "ASCII" "IM"
[32,] "ASCII" "VU"
[33,] "ASCII" "NG"
[34,] "ASCII" "VK"
[35,] "ASCII" "OG"
[36,] "ASCII" "SL"
      .                
 [1,] "ASCII" "BJ"     
 [2,] "ASCII" "BM"     
 [3,] "UTF-8" "\xc8K"  
 [4,] "ASCII" "DA"     
 [5,] "ASCII" "DE"     
 [6,] "ASCII" "DJ"     
 [7,] "ASCII" "DU"     
 [8,] "ASCII" "GS"     
 [9,] "ASCII" "IM"     
[10,] "ASCII" "KA"     
[11,] "ASCII" "KC"     
[12,] "ASCII" "KR"     
[13,] "ASCII" "KT"     
[14,] "UTF-8" "K\u008e"
[15,] "ASCII" "MA"     
[16,] "ASCII" "NA"     
[17,] "ASCII" "NG"     
[18,] "ASCII" "OG"     
[19,] "ASCII" "OS"     
[20,] "ASCII" "PU"     
[21,] "UTF-8" "P\u008e"
[22,] "ASCII" "RI"     
[23,] "ASCII" "SB"     
[24,] "ASCII" "SK"     
[25,] "ASCII" "ST"     
[26,] "UTF-8" "\u008aI"
[27,] "ASCII" "VK"     
[28,] "ASCII" "VU"     
[29,] "UTF-8" "V\u008e"
[30,] "ASCII" "ZD"     
[31,] "ASCII" "ZG"     
[32,] "UTF-8" "\u008eU"
[33,] "ASCII" "VT"  
就我所见,“文字”标签和Unicode代码标签都被编码为UTF-8,这让我感到惊讶,因为如果是这样,我无法理解为什么一个数据帧显示VŽ,另一个显示V\u008e

我想将编码标签转换为文字标签,我已尝试以下操作:

data2 %>%
  mutate(Label = recode(Label, "\xc8K" = "ČK",
                             "K\u008e" = "KŽ",
                             "P\u008e" = "PŽ",
                             "\u008aI" = "ŠI",
                             "V\u008e" = "VŽ",
                             "\u008eU" = "ŽU"))
但这没有成功,我收到以下警告:

Warning messages:
1: unable to translate 'K<U+008E>' to native encoding 
2: unable to translate 'P<U+008E>' to native encoding 
3: unable to translate '<U+008A>I' to native encoding 
4: unable to translate 'V<U+008E>' to native encoding 
5: unable to translate '<U+008E>U' to native encoding 
警告消息:
1:无法将“K”转换为本机编码
2:无法将“P”转换为本机编码
3:无法将“I”转换为本机编码
4:无法将“V”转换为本机编码
5:无法将“U”转换为本机编码
那么,如何正确地重新编码这些值呢?

\u008E
是单移位二,而
Ž
(带Caron的拉丁文大写字母Z)是unicode
\u017D
;然而,在美国和欧盟西部的ACP1252和中欧的ACP1250中,
Ž
\x8E