Ruby Unicode字符串到字节再到Unicode字符串不是不变的_Ruby_Unicode

Ruby Unicode字符串到字节再到Unicode字符串不是不变的

ruby unicode

Ruby Unicode字符串到字节再到Unicode字符串不是不变的,ruby,unicode,Ruby,Unicode,在Ruby 2.3.3中，我输入以下代码 require 'scanf' def hex2str(x) if x !~ /\A([0-9a-fA-F]{2})+\z/ then return nil; end x.scan(/.{2}/).map{|k| k.scanf("%x")[0].chr}.join end def str2hex(s); s.bytes.map {|k| "%02x" % k}.join; end s="ü" t=hex2str(str2hex(s))

在Ruby 2.3.3中，我输入以下代码

require 'scanf'

def hex2str(x)
  if x !~ /\A([0-9a-fA-F]{2})+\z/ then return nil; end
  x.scan(/.{2}/).map{|k| k.scanf("%x")[0].chr}.join
end

def str2hex(s); s.bytes.map {|k| "%02x" % k}.join; end

s="ü"
t=hex2str(str2hex(s))

p s
p t
s.bytes
t.bytes

我得到以下输出：

"ü"
"\xC3\xBC"
[195, 188]
[195, 188]

为什么是

s≠ hex2str（str2hex（s））

即使

s.bytes=hex2str（str2hex（s））.bytes

似乎在某个地方，某种形式的自动规范化正在发生。有没有办法避免这种情况？您能否提供不以任何方式干扰字节的

hex2str

和

str2hex

版本，并且满足

s=hex2str（str2hex（s））

？

在Ruby中，

字符串

对象的实例包含一个字节序列以及Ruby认为这些字节所在的编码。为了使两个字符串相等，它们基本上需要具有相同的字节和相同的编码（处理“ascii兼容”字符串时会有一些复杂情况，但本质上就是这样）

您可以更改此编码标记，而无需使用更改任何字节

例如，在编码ISO-8859-1中解释的字节0xC0是

À

，但在ISO-8859-2中它是

Ŕ

。很明显，尽管它们包含相同的字节，但它们并不相同：

# Use the optional argument to chr to specify the encoding to
# use when creating the string.
i1 = 0xC0.chr("ISO-8859-1")
i2 = 0xC0.chr("ISO-8859-2")

puts i1.bytes # => 192
puts i2.bytes # => 192

puts i1.encoding # => ISO-8859-1
puts i2.encoding # => ISO-8859-2

puts i1 == i2 # => false

在您的例子中，由于在调用

chr

时没有指定要使用的编码，Ruby默认使用ASCII-8BIT，这基本上意味着二进制编码。因此，生成的字符串具有不同的编码，Ruby认为它与原始字符串不同

由于您知道字符串应该采用的编码方式，您可以通过在

hex2str

中的

join

之后添加对

force_encoding

的调用来告诉Ruby编码方式（这里我假设原始字符串编码是UTF-8）：