Perl 正在修复同时包含UTF-8和Windows-1252的文件_Perl_Encoding_Character Encoding

Perl 正在修复同时包含UTF-8和Windows-1252的文件

perl encoding character-encoding

Perl 正在修复同时包含UTF-8和Windows-1252的文件,perl,encoding,character-encoding,Perl,Encoding,Character Encoding,我有一个生成UTF-8文件的应用程序，但其中一些内容的编码不正确。一些字符编码为iso-8859-1又名iso-latin-1或cp1252又名Windows-1252。有没有办法恢复原始文本？有显然，修复创建文件的程序更好，但这并不总是可能的。下面是两个解决方案一行可以包含多种编码提供名为fix_latin的函数，该函数对由UTF-8、iso-8859-1、cp1252和US-ASCII混合组成的文本进行解码 $ perl -e' use Encoding::FixLatin qw

我有一个生成UTF-8文件的应用程序，但其中一些内容的编码不正确。一些字符编码为iso-8859-1又名iso-latin-1或cp1252又名Windows-1252。有没有办法恢复原始文本？

有

显然，修复创建文件的程序更好，但这并不总是可能的。下面是两个解决方案

一行可以包含多种编码提供名为

fix_latin

的函数，该函数对由UTF-8、iso-8859-1、cp1252和US-ASCII混合组成的文本进行解码

$ perl -e'
   use Encoding::FixLatin qw( fix_latin );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = fix_latin($bytes);
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A

虽然采用了试探法，但它们相当可靠。只有以下情况才会失败：

其中一个使用iso-8859-1或cp1252编码，然后是一个™使用iso-8859-1或cp1252进行编码
使用iso-8859-1或cp1252编码的一个
，然后是两个™使用iso-8859-1或cp1252进行编码
使用iso-8859-1或cp1252编码的一个
[ðñòòõö÷]
之后是两个
[ð()††‡Œ[381;™使用iso-8859-1或cp1252进行编码

使用核心模块也可以产生相同的结果，尽管我认为这比安装了Encoding:：FixLatin:：XS的Encoding:：FixLatin慢一点

$ perl -e'
   use Encode qw( decode_utf8 encode_utf8 decode );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A

每行只使用一种编码

fix_拉丁语

在字符级别工作。如果已知每一行都是使用UTF-8、iso-8859-1、cp1252或US-ASCII中的一种进行编码的，那么您可以通过检查该行是否为有效的UTF-8来使该过程更加可靠

$ perl -e'
   use Encode qw( decode );
   for $bytes ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
      if (!eval {
         $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
         1  # No exception
      }) {
         $text = decode("cp1252", $bytes);
      }

      printf("U+%v04X\n", $text);
   }
'
U+00D0.0020.2019.0020.00D0.2019.000A
U+0412.000A

虽然采用了启发式方法，但它们非常可靠。只有当给定行的以下所有项均为真时，它们才会失败：

使用iso-8859-1或cp1252对线路进行编码
至少有一个
["ƒ††‡710‰èŒŽ“”•–-™码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码码数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数在生产线中出现了
所有
的实例后面总是紧跟着一个
[…]™覠›œſŸ
，§§«
µ···：¼½¾]
所有的
实例后面总是紧跟着两个
[(4)††‡710‰èèèèè"]
™覠›œſŸ
，§§«
µ···：¼½¾]
所有
[ðñòõö÷]
的实例后面总是紧跟着三个
[ðॢ…†‡ˆ‰èŒŽ'”“•–-”™覠›œſŸ
，§§«
µ···：¼½¾]
线路中不存在任何
[ùùýÿÿ]
，并且
没有任何一个™除前面提到的地方外，该行中有#›œſſſŸ
、ſ§«

注:

Encoding:：FixLatin安装命令行工具
```
fix\u-latin
```
来转换文件，使用第二种方法编写一个文件是很简单的
```
fix\u latin
```
（函数和文件）可以通过安装来加快速度
同样的方法也可用于UTF-8与其他单字节编码的混合。可靠性应该是相似的，但可能会有所不同

UTF-8 mixed with Latin-1 (ISO-8859-1):
U+00D0.0020.0092.0020.0412.000A
U+0412.000A

UTF-8 mixed with CP-1252 (Windows-1252):
U+00D0.0020.2019.0020.0412.000A
U+0412.000A

my %cp1252Encoding = (
# replacing the unicode code with the original CP1252 code
# see e.g. http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
"\x{20ac}" => "\x80",
"\x{201a}" => "\x82",
"\x{0192}" => "\x83",
"\x{201e}" => "\x84",
"\x{2026}" => "\x85",
"\x{2020}" => "\x86",
"\x{2021}" => "\x87",
"\x{02c6}" => "\x88",
"\x{2030}" => "\x89",
"\x{0160}" => "\x8a",
"\x{2039}" => "\x8b",
"\x{0152}" => "\x8c",
"\x{017d}" => "\x8e",

"\x{2018}" => "\x91",
"\x{2019}" => "\x92",
"\x{201c}" => "\x93",
"\x{201d}" => "\x94",
"\x{2022}" => "\x95",
"\x{2013}" => "\x96",
"\x{2014}" => "\x97",
"\x{02dc}" => "\x98",
"\x{2122}" => "\x99",
"\x{0161}" => "\x9a",
"\x{203a}" => "\x9b",
"\x{0153}" => "\x9c",
"\x{017e}" => "\x9e",
"\x{0178}" => "\x9f",
);
my $re = join "|", keys %cp1252Encoding;
$re = qr/$re/;
my %cp1252Decoding = reverse % cp1252Encoding;
my $cp1252Characters = join "|", keys %cp1252Decoding;

sub decodeUtf8
{
    my ($str) = @_;

    $str =~ s/$re/ $cp1252Encoding{$&} /eg;
    utf8::decode($str);
    return $str;
}

sub fixString
{
    my ($str) = @_;

    my $r = qr/[\x80-\xBF]|$re/;

    my $current;
    do {
        $current = $str;

        # If this matches, the string is likely double-encoded UTF-8. Try to decode
        $str =~ s/[\xF0-\xF7]$r$r$r|[\xE0-\xEF]$r$r|[\xC0-\xDF]$r/ decodeUtf8($&) /eg;

    } while ($str ne $current);

    # decodes any possible left-over cp1252 codes to Unicode
    $str =~ s/$cp1252Characters/ $cp1252Decoding{$&} /eg;
    return $str;
}

fix_拉丁语（$bytes）

decode_utf8（$bytes，sub{decode（'cp1252'，$[0]））

my %cp1252Encoding = (
# replacing the unicode code with the original CP1252 code
# see e.g. http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
"\x{20ac}" => "\x80",
"\x{201a}" => "\x82",
"\x{0192}" => "\x83",
"\x{201e}" => "\x84",
"\x{2026}" => "\x85",
"\x{2020}" => "\x86",
"\x{2021}" => "\x87",
"\x{02c6}" => "\x88",
"\x{2030}" => "\x89",
"\x{0160}" => "\x8a",
"\x{2039}" => "\x8b",
"\x{0152}" => "\x8c",
"\x{017d}" => "\x8e",

"\x{2018}" => "\x91",
"\x{2019}" => "\x92",
"\x{201c}" => "\x93",
"\x{201d}" => "\x94",
"\x{2022}" => "\x95",
"\x{2013}" => "\x96",
"\x{2014}" => "\x97",
"\x{02dc}" => "\x98",
"\x{2122}" => "\x99",
"\x{0161}" => "\x9a",
"\x{203a}" => "\x9b",
"\x{0153}" => "\x9c",
"\x{017e}" => "\x9e",
"\x{0178}" => "\x9f",
);
my $re = join "|", keys %cp1252Encoding;
$re = qr/$re/;
my %cp1252Decoding = reverse % cp1252Encoding;
my $cp1252Characters = join "|", keys %cp1252Decoding;

sub decodeUtf8
{
    my ($str) = @_;

    $str =~ s/$re/ $cp1252Encoding{$&} /eg;
    utf8::decode($str);
    return $str;
}

sub fixString
{
    my ($str) = @_;

    my $r = qr/[\x80-\xBF]|$re/;

    my $current;
    do {
        $current = $str;

        # If this matches, the string is likely double-encoded UTF-8. Try to decode
        $str =~ s/[\xF0-\xF7]$r$r$r|[\xE0-\xEF]$r$r|[\xC0-\xDF]$r/ decodeUtf8($&) /eg;

    } while ($str ne $current);

    # decodes any possible left-over cp1252 codes to Unicode
    $str =~ s/$cp1252Characters/ $cp1252Decoding{$&} /eg;
    return $str;
}