Ruby 英镑符号£;导致PG::CharacterNotInRepertoire:错误:编码的字节序列无效“;UTF8”;:0xa3

Ruby 英镑符号£;导致PG::CharacterNotInRepertoire:错误:编码的字节序列无效“;UTF8”;:0xa3,ruby,postgresql,ruby-on-rails-4,encoding,utf-8,Ruby,Postgresql,Ruby On Rails 4,Encoding,Utf 8,当通过csv文件从外部来源(如我的银行)收集包含英镑符号“£”的信息,并使用ActiveRecord发布到postgres时,我得到错误: PG::CharacterNotInRepertoire:错误:编码“UTF8”的字节序列无效:0xa3 0xa3是符号的十六进制代码。明智的做法是在字符串上明确指定UTF-8,同时替换无效的字节序列 string.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replac

当通过csv文件从外部来源(如我的银行)收集包含英镑符号“£”的信息,并使用ActiveRecord发布到postgres时,我得到错误:

PG::CharacterNotInRepertoire:错误:编码“UTF8”的字节序列无效:0xa3

0xa3是符号的十六进制代码。明智的做法是在字符串上明确指定UTF-8,同时替换无效的字节序列

string.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
这会停止错误,但在将“£”转换为“?”时是有损修复


UTF-8能够处理“镑”符号,那么可以做些什么来修复无效的字节序列并保留“镑”符号呢?

我要回答我自己的问题,这要感谢Michael Fuhr,他解释了磅符号是0xc2 0xa3的原因。所以,您所要做的就是找到0xa3(163)的每个匹配项,并将0xc2(194)放在它前面

array_bytes = string.bytes
new_pound_ptr = 0
# Look for £ sign 
pound_ptr = array_bytes.index(163)
while !pound_ptr.nil?
  pound_ptr+= new_pound_ptr # new_pound_ptr is set at end of block
  # The following statement finds incorrectly sequenced £ sign...
  if (pound_ptr == 0) || (array_bytes[pound_ptr-1] != 194)
    array_bytes.insert(pound_ptr,194)
      pound_ptr+= 1
    end
    new_pound_ptr = pound_ptr
    # Search remainder of array for pound sign
    pound_ptr = array_bytes[(new_pound_ptr+1)..-1].index(163)
  end
end
# Convert bytes to 8-bit unsigned char, and UTF-8
string = array_bytes.pack('C*').force_encoding('UTF-8') unless new_pound_ptr == 0
# Can now write string to model without out-of-sequence error..
hash["description"] = string
Model.create!(hash)

在这个stackoverflow论坛上,我得到了很多帮助,我希望我帮助了其他人。

我要回答我自己的问题,感谢Michael Fuhr,他解释了英镑符号的符号是0xc2 0xa3。所以,您所要做的就是找到0xa3(163)的每个匹配项,并将0xc2(194)放在它前面

array_bytes = string.bytes
new_pound_ptr = 0
# Look for £ sign 
pound_ptr = array_bytes.index(163)
while !pound_ptr.nil?
  pound_ptr+= new_pound_ptr # new_pound_ptr is set at end of block
  # The following statement finds incorrectly sequenced £ sign...
  if (pound_ptr == 0) || (array_bytes[pound_ptr-1] != 194)
    array_bytes.insert(pound_ptr,194)
      pound_ptr+= 1
    end
    new_pound_ptr = pound_ptr
    # Search remainder of array for pound sign
    pound_ptr = array_bytes[(new_pound_ptr+1)..-1].index(163)
  end
end
# Convert bytes to 8-bit unsigned char, and UTF-8
string = array_bytes.pack('C*').force_encoding('UTF-8') unless new_pound_ptr == 0
# Can now write string to model without out-of-sequence error..
hash["description"] = string
Model.create!(hash)

在这个stackoverflow论坛上我得到了很多帮助,我希望我帮助了其他人。

0xa3是microsuft的cp1252(和iso8859-1)中英镑符号的代码点。您的数据可能没有编码为utf8。您是对的@wildplasser,源文件有Microsoft编码-一个扩展名为.xls的HTML文件下载。Ruby将其处理为UTF-8,而英镑符号之前没有正确的字符序列。0xa3是microsuft的cp1252(和iso8859-1)中英镑符号的代码点。您的数据可能没有编码为utf8。您是对的@wildplasser,源文件有Microsoft编码-一个扩展名为.xls的HTML文件下载。Ruby将其处理为UTF-8,除了“%”符号之外,该符号前面没有正确的字符序列。