Encoding &引用;生的;从双UTF-8转换为UTF-8(或从UTF-8转换为ANSI)
我正在处理一个使用UTF-8编码了两次的遗留文件。例如,编码点Encoding &引用;生的;从双UTF-8转换为UTF-8(或从UTF-8转换为ANSI),encoding,utf-8,character-encoding,iconv,cp1252,Encoding,Utf 8,Character Encoding,Iconv,Cp1252,我正在处理一个使用UTF-8编码了两次的遗留文件。例如,编码点ε(U+03B5)本应编码为ceb5,但改为编码为c38ec2b5(ce8e是U+00CE的UTF-8编码,c2b5是U+00B5的UTF-8编码) 假设数据在CP-1252中编码,则执行第二次编码 为了回到UTF-8编码,我使用了以下(似乎错误)命令 我如何告诉iconv只执行数学UTF-8转换而不关心映射? echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to iso8859
ε
(U+03B5
)本应编码为ceb5
,但改为编码为c38ec2b5
(ce8e
是U+00CE
的UTF-8编码,c2b5
是U+00B5
的UTF-8编码)
假设数据在CP-1252中编码,则执行第二次编码
为了回到UTF-8编码,我使用了以下(似乎错误)命令
我如何告诉iconv只执行数学UTF-8转换而不关心映射?
echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to iso8859-1
Windows-1252在0x80-0x9F范围内与ISO-8859-1不同。例如,在您的例子中,0x81在ISO 8859-1中是U+0081,但在Windows-1252中无效
检查您的其余数据是否被误解为Windows-1252或ISO 8859-1。通常,ISO 8859-1更为常见。以下代码使用Ruby的低级编码函数强制将双编码UTF-8(从CP1525)重写为正常UTF-8
#!/usr/bin/env ruby
ec = Encoding::Converter.new(Encoding::UTF_8, Encoding::CP1252)
prev_b = nil
orig_bytes = STDIN.read.force_encoding(Encoding::BINARY).bytes.to_a
real_utf8_bytes = ""
real_utf8_bytes.force_encoding(Encoding::BINARY)
orig_bytes.each_with_index do |b, i|
b = b.chr
situation = ec.primitive_convert(b.dup, real_utf8_bytes, nil, nil, Encoding::Converter::PARTIAL_INPUT)
if situation == :undefined_conversion
if prev_b != "\xC2"
$stderr.puts "ERROR found byte #{b.dump} in stream (prev #{(prev_b||'').dump})"
exit
end
real_utf8_bytes.force_encoding(Encoding::BINARY)
real_utf8_bytes << b
real_utf8_bytes.force_encoding(Encoding::CP1252)
end
prev_b = b
end
real_utf8_bytes.force_encoding(Encoding::BINARY)
puts real_utf8_bytes
这会导致其他字符出现问题,这些字符在拉丁文1中不可映射,但在cp1252中,例如:‰
U+2030
。我很确定中间编码是cp1252,但在这种情况下这并不重要,因为UTF-8转换被盲目地应用于某些字节。你能举一个十六进制的例子吗?这不能用拉丁语翻译:\xc3\xa1\xc2\xbc\xe2\x82\xac
。我明白了。那么我唯一能建议的就是编写一些自定义代码来修复这个问题。
echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to iso8859-1
#!/usr/bin/env ruby
ec = Encoding::Converter.new(Encoding::UTF_8, Encoding::CP1252)
prev_b = nil
orig_bytes = STDIN.read.force_encoding(Encoding::BINARY).bytes.to_a
real_utf8_bytes = ""
real_utf8_bytes.force_encoding(Encoding::BINARY)
orig_bytes.each_with_index do |b, i|
b = b.chr
situation = ec.primitive_convert(b.dup, real_utf8_bytes, nil, nil, Encoding::Converter::PARTIAL_INPUT)
if situation == :undefined_conversion
if prev_b != "\xC2"
$stderr.puts "ERROR found byte #{b.dump} in stream (prev #{(prev_b||'').dump})"
exit
end
real_utf8_bytes.force_encoding(Encoding::BINARY)
real_utf8_bytes << b
real_utf8_bytes.force_encoding(Encoding::CP1252)
end
prev_b = b
end
real_utf8_bytes.force_encoding(Encoding::BINARY)
puts real_utf8_bytes
cat $PROBLEMATIC_FILE | ./fix-double-utf8-encoding.rb > $CORRECTED_FILE