Perl 正在修复同时包含UTF-8和Windows-1252的文件

Perl 正在修复同时包含UTF-8和Windows-1252的文件,perl,encoding,character-encoding,Perl,Encoding,Character Encoding,我有一个生成UTF-8文件的应用程序,但其中一些内容的编码不正确。一些字符编码为iso-8859-1又名iso-latin-1或cp1252又名Windows-1252。有没有办法恢复原始文本?有 显然,修复创建文件的程序更好,但这并不总是可能的。下面是两个解决方案 一行可以包含多种编码 提供名为fix_latin的函数,该函数对由UTF-8、iso-8859-1、cp1252和US-ASCII混合组成的文本进行解码 $ perl -e' use Encoding::FixLatin qw

我有一个生成UTF-8文件的应用程序,但其中一些内容的编码不正确。一些字符编码为iso-8859-1又名iso-latin-1或cp1252又名Windows-1252。有没有办法恢复原始文本?

显然,修复创建文件的程序更好,但这并不总是可能的。下面是两个解决方案

一行可以包含多种编码 提供名为
fix_latin
的函数,该函数对由UTF-8、iso-8859-1、cp1252和US-ASCII混合组成的文本进行解码

$ perl -e'
   use Encoding::FixLatin qw( fix_latin );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = fix_latin($bytes);
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A
虽然采用了试探法,但它们相当可靠。只有以下情况才会失败:

  • 其中一个使用iso-8859-1或cp1252编码,然后是一个™使用iso-8859-1或cp1252进行编码

  • 使用iso-8859-1或cp1252编码的一个
    ,然后是两个™使用iso-8859-1或cp1252进行编码

  • 使用iso-8859-1或cp1252编码的一个
    [ðñòòõö÷]
    之后是两个
    [ð()††‡Œ[381;™使用iso-8859-1或cp1252进行编码

使用核心模块也可以产生相同的结果,尽管我认为这比安装了Encoding::FixLatin::XS的Encoding::FixLatin慢一点

$ perl -e'
   use Encode qw( decode_utf8 encode_utf8 decode );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A
每行只使用一种编码
fix_拉丁语
在字符级别工作。如果已知每一行都是使用UTF-8、iso-8859-1、cp1252或US-ASCII中的一种进行编码的,那么您可以通过检查该行是否为有效的UTF-8来使该过程更加可靠

$ perl -e'
   use Encode qw( decode );
   for $bytes ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
      if (!eval {
         $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
         1  # No exception
      }) {
         $text = decode("cp1252", $bytes);
      }

      printf("U+%v04X\n", $text);
   }
'
U+00D0.0020.2019.0020.00D0.2019.000A
U+0412.000A
虽然采用了启发式方法,但它们非常可靠。只有当给定行的以下所有项均为真时,它们才会失败:

  • 使用iso-8859-1或cp1252对线路进行编码

  • 至少有一个
    ["ƒ††‡710‰èŒŽ“”•–-™码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数在生产线中出现了

  • 所有
    的实例后面总是紧跟着一个
    […]™覠›œſŸ
    ,§§«
    µ···:¼½¾]

  • 所有的
    实例后面总是紧跟着两个
    [(4)††‡710‰èèèèè"]
    ™覠›œſŸ
    ,§§«
    µ···:¼½¾]

  • 所有
    [ðñòõö÷]
    的实例后面总是紧跟着三个
    [ðॢ…†‡ˆ‰èŒŽ'”“•–-”™覠›œſŸ
    ,§§«
    µ···:¼½¾]

  • 线路中不存在任何
    [ùùýÿÿ]
    ,并且

  • 没有任何一个™除前面提到的地方外,该行中有#›œſſſŸ
    、ſ§«


注:

  • Encoding::FixLatin安装命令行工具
    fix\u-latin
    来转换文件,使用第二种方法编写一个文件是很简单的
  • fix\u latin
    (函数和文件)可以通过安装来加快速度
  • 同样的方法也可用于UTF-8与其他单字节编码的混合。可靠性应该是相似的,但可能会有所不同

    • 这是我写作的原因之一。对于Unicode::UTF8,使用中的回退选项很简单

      输出:

      UTF-8 mixed with Latin-1 (ISO-8859-1):
      U+00D0.0020.0092.0020.0412.000A
      U+0412.000A
      
      UTF-8 mixed with CP-1252 (Windows-1252):
      U+00D0.0020.2019.0020.0412.000A
      U+0412.000A
      

      Unicode::UTF8是用C/XS编写的,只有在遇到格式错误的UTF-8序列时才会调用回调/回退。

      最近,我遇到了UTF-8、CP1252和UTF-8编码严重混合的文件,然后被解释为CP1252,然后再次被编码为UTF-8,再次被解释为CP1252,等等

      我写了下面的代码,这对我来说很好。它寻找典型的UTF-8字节序列,即使其中一些字节不是UTF-8,而是等效CP1252字节的Unicode表示

      my %cp1252Encoding = (
      # replacing the unicode code with the original CP1252 code
      # see e.g. http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
      "\x{20ac}" => "\x80",
      "\x{201a}" => "\x82",
      "\x{0192}" => "\x83",
      "\x{201e}" => "\x84",
      "\x{2026}" => "\x85",
      "\x{2020}" => "\x86",
      "\x{2021}" => "\x87",
      "\x{02c6}" => "\x88",
      "\x{2030}" => "\x89",
      "\x{0160}" => "\x8a",
      "\x{2039}" => "\x8b",
      "\x{0152}" => "\x8c",
      "\x{017d}" => "\x8e",
      
      "\x{2018}" => "\x91",
      "\x{2019}" => "\x92",
      "\x{201c}" => "\x93",
      "\x{201d}" => "\x94",
      "\x{2022}" => "\x95",
      "\x{2013}" => "\x96",
      "\x{2014}" => "\x97",
      "\x{02dc}" => "\x98",
      "\x{2122}" => "\x99",
      "\x{0161}" => "\x9a",
      "\x{203a}" => "\x9b",
      "\x{0153}" => "\x9c",
      "\x{017e}" => "\x9e",
      "\x{0178}" => "\x9f",
      );
      my $re = join "|", keys %cp1252Encoding;
      $re = qr/$re/;
      my %cp1252Decoding = reverse % cp1252Encoding;
      my $cp1252Characters = join "|", keys %cp1252Decoding;
      
      sub decodeUtf8
      {
          my ($str) = @_;
      
          $str =~ s/$re/ $cp1252Encoding{$&} /eg;
          utf8::decode($str);
          return $str;
      }
      
      sub fixString
      {
          my ($str) = @_;
      
          my $r = qr/[\x80-\xBF]|$re/;
      
          my $current;
          do {
              $current = $str;
      
              # If this matches, the string is likely double-encoded UTF-8. Try to decode
              $str =~ s/[\xF0-\xF7]$r$r$r|[\xE0-\xEF]$r$r|[\xC0-\xDF]$r/ decodeUtf8($&) /eg;
      
          } while ($str ne $current);
      
          # decodes any possible left-over cp1252 codes to Unicode
          $str =~ s/$cp1252Characters/ $cp1252Decoding{$&} /eg;
          return $str;
      }
      

      这与ikegami的回答有类似的限制,除了同样的限制也适用于UTF-8编码字符串。

      (这是Perl中的一个常见问题,因为解码文本在未编码的情况下发出。)我不认为这是Perl特有的,Ruby和PHP也有同样的问题。Python 3对字节和字符有不同的类型。有效的UTF-8序列不应该被解码为U+0412吗?@chansen,当你知道这行是用cp1252编码的时候,它不应该被解码+1这可能比Encode::decode_utf8快一点(因为Encode::decode_utf8的回调必须生成UTF-8),但它仍然使用回调。编码::FixLatin没有,所以它肯定会更快。(如果不是,它可以变得更快。)它也更简单(
      fix_拉丁语($bytes)
      vs
      decode_utf8($bytes,sub{decode('cp1252',$[0]))
      )@ikegami,Unicode::utf8完全优于Encode的UTF-8实现。我所说的都不是针对cp1252的。同样的道理也适用于拉丁语-1。
      my %cp1252Encoding = (
      # replacing the unicode code with the original CP1252 code
      # see e.g. http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
      "\x{20ac}" => "\x80",
      "\x{201a}" => "\x82",
      "\x{0192}" => "\x83",
      "\x{201e}" => "\x84",
      "\x{2026}" => "\x85",
      "\x{2020}" => "\x86",
      "\x{2021}" => "\x87",
      "\x{02c6}" => "\x88",
      "\x{2030}" => "\x89",
      "\x{0160}" => "\x8a",
      "\x{2039}" => "\x8b",
      "\x{0152}" => "\x8c",
      "\x{017d}" => "\x8e",
      
      "\x{2018}" => "\x91",
      "\x{2019}" => "\x92",
      "\x{201c}" => "\x93",
      "\x{201d}" => "\x94",
      "\x{2022}" => "\x95",
      "\x{2013}" => "\x96",
      "\x{2014}" => "\x97",
      "\x{02dc}" => "\x98",
      "\x{2122}" => "\x99",
      "\x{0161}" => "\x9a",
      "\x{203a}" => "\x9b",
      "\x{0153}" => "\x9c",
      "\x{017e}" => "\x9e",
      "\x{0178}" => "\x9f",
      );
      my $re = join "|", keys %cp1252Encoding;
      $re = qr/$re/;
      my %cp1252Decoding = reverse % cp1252Encoding;
      my $cp1252Characters = join "|", keys %cp1252Decoding;
      
      sub decodeUtf8
      {
          my ($str) = @_;
      
          $str =~ s/$re/ $cp1252Encoding{$&} /eg;
          utf8::decode($str);
          return $str;
      }
      
      sub fixString
      {
          my ($str) = @_;
      
          my $r = qr/[\x80-\xBF]|$re/;
      
          my $current;
          do {
              $current = $str;
      
              # If this matches, the string is likely double-encoded UTF-8. Try to decode
              $str =~ s/[\xF0-\xF7]$r$r$r|[\xE0-\xEF]$r$r|[\xC0-\xDF]$r/ decodeUtf8($&) /eg;
      
          } while ($str ne $current);
      
          # decodes any possible left-over cp1252 codes to Unicode
          $str =~ s/$cp1252Characters/ $cp1252Decoding{$&} /eg;
          return $str;
      }