Utf 8 在Perl中将open（）与标量层和I/O层一起使用时编码的非确定性_Utf 8_Character Encoding_Perl_Perl Io

Utf 8 在Perl中将open（）与标量层和I/O层一起使用时编码的非确定性

utf-8 character-encoding perl

Utf 8 在Perl中将open（）与标量层和I/O层一起使用时编码的非确定性,utf-8,character-encoding,perl,perl-io,Utf 8,Character Encoding,Perl,Perl Io,几个小时以来，我一直在与Perl程序中的一个bug作斗争。我不确定是我做错了什么，还是解释器做错了什么，但在我看来，代码是非确定性的，而它应该是确定性的。此外，它在古老的Debian Lenny（Perl 5.10.0）和刚刚升级到Debian Wheezy（Perl 5.14.2）的服务器上表现出了相同的行为。它归结为以下一段Perl代码： #!/usr/bin/perl use warnings; use strict; use utf8; binmode STDOUT, ":utf8";

几个小时以来，我一直在与Perl程序中的一个bug作斗争。我不确定是我做错了什么，还是解释器做错了什么，但在我看来，代码是非确定性的，而它应该是确定性的。此外，它在古老的Debian Lenny（Perl 5.10.0）和刚刚升级到Debian Wheezy（Perl 5.14.2）的服务器上表现出了相同的行为。它归结为以下一段Perl代码：

#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $c = "";
open C, ">:utf8", \$c;
print C "š";
close C;
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";

它在启用警告的严格模式下初始化Perl 5解释器，使用UTF8编码的字符串（与字节字符串相反）和命名标准流（UTF-8的内部概念，但非常接近；更改为完整UTF-8没有区别）。然后它打开一个“内存文件”（标量变量）的文件句柄，在其中打印一个两字节的UTF-8字符，并在关闭时检查变量

标量变量现在总是关闭UTF8位。但是，它有时包含一个字节字符串（通过

utf8:：decode（）

）转换为字符串），有时还包含一个只需在其utf8位上翻转的字符串（

Encode:：\u utf8\u on（）

）

当我重复执行代码（通过Bash执行1000次）时，它会以大致相同的频率打印

未编码

和

解码

。当我更改写入“文件”中的字符串时，例如在其末尾添加一个换行符，

未编码

将消失。当

utf8:：decode

成功，并且我在循环中尝试相同的原始字符串时，它在解释器的相同实例中保持成功；然而，如果它失败了，它就会继续失败

对观察到的行为有什么解释？如何将文件句柄与字符串一起用于标量变量？

Bash游乐场：

for i in {1..1000}; do perl -we 'use strict; use utf8; binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; my $c = ""; open C, ">:utf8", \$c; print C "š"; close C; die "Does not happen\n" if utf8::is_utf8($c); print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";'; done | grep Undecoded | wc -l

为了参考和绝对肯定，我还制作了一个带有迂腐错误处理的版本——结果相同

#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8" or die "Cannot binmode STDOUT\n";
binmode STDERR, ":utf8" or die "Cannot binmode STDERR\n";
my $c = "";
open C, ">:utf8", \$c or die "Cannot open: $!\n";
print C "š" or die "Cannot print: $!\n";
close C or die "Cannot close: $!\n";
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";

详细检查

$c

可以发现它与

$c

或其内部内容无关，而

解码的结果准确地表示它做了什么或没有做什么
$ for i in {1..2}; do
     perl -MDevel::Peek -we'
        use strict; use utf8;
        binmode STDOUT, ":utf8";
        binmode STDERR, ":utf8";
        my $c = "";
        open C, ">:utf8", \$c;
        print C "š";
        close C;
        die "Does not happen\n" if utf8::is_utf8($c);
        Dump($c);
        print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
        Dump($c)
     '
     echo
  done



这是utf8:：decode
中的一个错误，但它在5.16.3或更早版本中已修复，可能是5.16.0，因为它仍然存在于5.14.2中
一个合适的解决方法是使用Encode的decode\u utf8
。
我设法找到了错误。其固定版本，首次包含在v5.15.8中。感谢PerlMonks.org上的Corion帮助我在Perl源代码中查找Perl_sv_utf8_decode，以及GitHub，使我能够在不下载大量数据的情况下使用repo（特别是发现我需要阅读Perl_sv_utf8_decode并读取其错误）。
SV = PV(0x17c8470) at 0x17de990
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x17d7a40 "\305\241"
  CUR = 2
  LEN = 16
Decoded
SV = PV(0x17c8470) at 0x17de990
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x17d7a40 "\305\241" [UTF8 "\x{161}"]
  CUR = 2
  LEN = 16

SV = PV(0x2d0fee0) at 0x2d26400
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x2d1f4b0 "\305\241"
  CUR = 2
  LEN = 16
Undecoded
SV = PV(0x2d0fee0) at 0x2d26400
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x2d1f4b0 "\305\241"
  CUR = 2
  LEN = 16