Encoding &引用;生的;从双UTF-8转换为UTF-8(或从UTF-8转换为ANSI)

Encoding &引用;生的;从双UTF-8转换为UTF-8(或从UTF-8转换为ANSI),encoding,utf-8,character-encoding,iconv,cp1252,Encoding,Utf 8,Character Encoding,Iconv,Cp1252,我正在处理一个使用UTF-8编码了两次的遗留文件。例如,编码点ε(U+03B5)本应编码为ceb5,但改为编码为c38ec2b5(ce8e是U+00CE的UTF-8编码,c2b5是U+00B5的UTF-8编码) 假设数据在CP-1252中编码,则执行第二次编码 为了回到UTF-8编码,我使用了以下(似乎错误)命令 我如何告诉iconv只执行数学UTF-8转换而不关心映射? echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to iso8859

我正在处理一个使用UTF-8编码了两次的遗留文件。例如,编码点
ε
U+03B5
)本应编码为
ceb5
,但改为编码为
c38ec2b5
ce8e
U+00CE
的UTF-8编码,
c2b5
U+00B5
的UTF-8编码)

假设数据在CP-1252中编码,则执行第二次编码

为了回到UTF-8编码,我使用了以下(似乎错误)命令

我如何告诉iconv只执行数学UTF-8转换而不关心映射?

echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to iso8859-1
Windows-1252在0x80-0x9F范围内与ISO-8859-1不同。例如,在您的例子中,0x81在ISO 8859-1中是U+0081,但在Windows-1252中无效


检查您的其余数据是否被误解为Windows-1252或ISO 8859-1。通常,ISO 8859-1更为常见。

以下代码使用Ruby的低级编码函数强制将双编码UTF-8(从CP1525)重写为正常UTF-8

#!/usr/bin/env ruby

ec = Encoding::Converter.new(Encoding::UTF_8, Encoding::CP1252)

prev_b = nil

orig_bytes = STDIN.read.force_encoding(Encoding::BINARY).bytes.to_a
real_utf8_bytes = ""
real_utf8_bytes.force_encoding(Encoding::BINARY)

orig_bytes.each_with_index do |b, i|
    b = b.chr

    situation = ec.primitive_convert(b.dup, real_utf8_bytes, nil, nil, Encoding::Converter::PARTIAL_INPUT)

    if situation == :undefined_conversion
            if prev_b != "\xC2"
                    $stderr.puts "ERROR found byte #{b.dump} in stream (prev #{(prev_b||'').dump})"
                    exit
            end

            real_utf8_bytes.force_encoding(Encoding::BINARY)
            real_utf8_bytes << b
            real_utf8_bytes.force_encoding(Encoding::CP1252)
    end

    prev_b = b
end

real_utf8_bytes.force_encoding(Encoding::BINARY)
puts real_utf8_bytes

这会导致其他字符出现问题,这些字符在拉丁文1中不可映射,但在cp1252中,例如:‰
U+2030
。我很确定中间编码是cp1252,但在这种情况下这并不重要,因为UTF-8转换被盲目地应用于某些字节。你能举一个十六进制的例子吗?这不能用拉丁语翻译:
\xc3\xa1\xc2\xbc\xe2\x82\xac
。我明白了。那么我唯一能建议的就是编写一些自定义代码来修复这个问题。
echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to iso8859-1
#!/usr/bin/env ruby

ec = Encoding::Converter.new(Encoding::UTF_8, Encoding::CP1252)

prev_b = nil

orig_bytes = STDIN.read.force_encoding(Encoding::BINARY).bytes.to_a
real_utf8_bytes = ""
real_utf8_bytes.force_encoding(Encoding::BINARY)

orig_bytes.each_with_index do |b, i|
    b = b.chr

    situation = ec.primitive_convert(b.dup, real_utf8_bytes, nil, nil, Encoding::Converter::PARTIAL_INPUT)

    if situation == :undefined_conversion
            if prev_b != "\xC2"
                    $stderr.puts "ERROR found byte #{b.dump} in stream (prev #{(prev_b||'').dump})"
                    exit
            end

            real_utf8_bytes.force_encoding(Encoding::BINARY)
            real_utf8_bytes << b
            real_utf8_bytes.force_encoding(Encoding::CP1252)
    end

    prev_b = b
end

real_utf8_bytes.force_encoding(Encoding::BINARY)
puts real_utf8_bytes
cat $PROBLEMATIC_FILE | ./fix-double-utf8-encoding.rb > $CORRECTED_FILE