Perl 基准测试utf8文件读取-差异说明

Perl 基准测试utf8文件读取-差异说明,perl,utf-8,io,encode,Perl,Utf 8,Io,Encode,有以下代码: #!/usr/bin/env perl use 5.016; use warnings; use autodie; use Path::Tiny; use Encode; use Benchmark qw(:all); my $cnt = 10_000; my $utf = 'utf8.txt'; my $res = timethese($cnt, { 'open-UTF-8' => sub { open my $fhu, '<:encod

有以下代码:

#!/usr/bin/env perl

use 5.016;
use warnings;
use autodie;
use Path::Tiny;
use Encode;
use Benchmark qw(:all);

my $cnt = 10_000;
my $utf = 'utf8.txt';

my $res = timethese($cnt, {
    'open-UTF-8' => sub {
        open my $fhu, '<:encoding(UTF-8)', $utf;
        my $stru = do { local $/; <$fhu>};
        close $fhu;
    },
    'open-utf8' => sub {
        open my $fhu, '<:utf8', $utf;
        my $stru = do { local $/; <$fhu>};
        close $fhu;
    },
    'decode-utf8' => sub {
        open my $fhu, '<', $utf;
        my $stru = decode('utf8', do { local $/; <$fhu>});
        close $fhu;
    },
    'decode-UTF-8' => sub {
        open my $fhu, '<', $utf;
        my $stru = decode('UTF-8', do { local $/; <$fhu>});
        close $fhu;
    },
    'ptiny' => sub {
        my $stru = path($utf)->slurp_utf8;
    },
});
cmpthese $res;
在我的笔记本上运行上述命令可以:

Benchmark: timing 10000 iterations of decode-UTF-8, decode-utf8, open-UTF-8, open-utf8, ptiny...
decode-UTF-8: 47 wallclock secs (46.83 usr +  0.87 sys = 47.70 CPU) @ 209.64/s (n=10000)
 decode-utf8: 48 wallclock secs (46.62 usr +  0.90 sys = 47.52 CPU) @ 210.44/s (n=10000)
  open-UTF-8: 60 wallclock secs (57.82 usr +  1.20 sys = 59.02 CPU) @ 169.43/s (n=10000)
   open-utf8:  7 wallclock secs ( 6.57 usr +  0.70 sys =  7.27 CPU) @ 1375.52/s (n=10000)
       ptiny:  7 wallclock secs ( 5.98 usr +  0.52 sys =  6.50 CPU) @ 1538.46/s (n=10000)
               Rate  open-UTF-8 decode-UTF-8 decode-utf8   open-utf8       ptiny
open-UTF-8    169/s          --         -19%        -19%        -88%        -89%
decode-UTF-8  210/s         24%           --         -0%        -85%        -86%
decode-utf8   210/s         24%           0%          --        -85%        -86%
open-utf8    1376/s        712%         556%        554%          --        -11%
ptiny        1538/s        808%         634%        631%         12%          --
令我惊讶的是,问题是:

  • 首先-上述代码是否有问题
如果可以的话

  • 为什么显式的
    UTF-8
    和松弛的
    utf8
    之间存在巨大差异,但仅在IO层级别(
    :utf8
    PerlIO层是一个伪层,它只是PerlIO句柄上OP检测到的一个标志。根据使用的OP,行为会有所不同:

    read()、sysread()和recv():

    不执行utf8序列的验证。执行utf8序列以计数读取的utf8序列的数量

    readline():

    验证读取的八位字节类别
    “utf8”
    是否有效,并在读取的八位字节包含格式错误的utf8时发出警告。使用的验证过程与中使用的相同

    “:utf8”标志/层不应用于读取,除非您愿意接受格式错误的UTF-X,这可能导致错误或分段错误

    :编码 PerlIO
    :encoding
    层由其提供,该层为的子类实现增量解码器框架。通过调用每个增量解码的方法,调用Perl/XS子类。缓冲区在层和子类之间复制

    utf8与UTF-8 utf8编码格式是联合体指定的UTF-8编码格式的超集。utf8编码格式接受UTF-8编码格式中格式错误的编码代码点,例如和U+10FFFF以上的代码点。即使他们认为是Unicode,也应避免使用。utf8编码不应用于交换,这是Perl的内部编码。请改用UTF-8编码形式

    lurping UTF-8编码文件的基准测试 基准测试中使用的模块:

    ,及

    上也提供了以下代码


    她是我的成果(Ubuntu 14.10、联想Edge 540、英特尔(R)Core(TM)i7-4712MQ CPU@2.30GHz):.Input file:@HåkonHægland类似。层上的lazy/strict
    和lazy/strict
    decode
    之间也存在巨大差异。路径::tiny令人惊讶。@jm666,原因可能是HåkonHægland没有安装Unicode::UTF8。只需
    使用警告qw(致命UTF8)
    和常规的
    :utf8
    层非常好。您可能还想分别控制子警告
    代理
    非unicode
    非字符
    -这在使用慢速模块时甚至是不可能的。@tchrist,不是,子警告只应用于输出,使用readline()输入它们没有效果。的utf8编码接受格式错误的UTF-8,例如编码的代理。请尝试以下操作:
    $perl-E'打开我的$fh,“@tchrist,顺便说一句,并不慢,它实现了对所有子警告的支持
    代理
    非unicode
    非字符
    。您的数字令人印象深刻。值得一提的是,我的意思是
    :编码(UTF-8)
    ,这是出了名的慢,而且对子警告也没有反应。
    Benchmark: timing 10000 iterations of decode-UTF-8, decode-utf8, open-UTF-8, open-utf8, ptiny...
    decode-UTF-8: 47 wallclock secs (46.83 usr +  0.87 sys = 47.70 CPU) @ 209.64/s (n=10000)
     decode-utf8: 48 wallclock secs (46.62 usr +  0.90 sys = 47.52 CPU) @ 210.44/s (n=10000)
      open-UTF-8: 60 wallclock secs (57.82 usr +  1.20 sys = 59.02 CPU) @ 169.43/s (n=10000)
       open-utf8:  7 wallclock secs ( 6.57 usr +  0.70 sys =  7.27 CPU) @ 1375.52/s (n=10000)
           ptiny:  7 wallclock secs ( 5.98 usr +  0.52 sys =  6.50 CPU) @ 1538.46/s (n=10000)
                   Rate  open-UTF-8 decode-UTF-8 decode-utf8   open-utf8       ptiny
    open-UTF-8    169/s          --         -19%        -19%        -88%        -89%
    decode-UTF-8  210/s         24%           --         -0%        -85%        -86%
    decode-utf8   210/s         24%           0%          --        -85%        -86%
    open-utf8    1376/s        712%         556%        554%          --        -11%
    ptiny        1538/s        808%         634%        631%         12%          --
    
    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use Benchmark     qw[];
    use Config        qw[%Config];
    use IO::Dir       qw[];
    use IO::File      qw[SEEK_SET];
    
    use Encode              qw[];
    use Unicode::UTF8       qw[];
    use PerlIO::encoding    qw[];
    use PerlIO::utf8_strict qw[];
    
    # https://github.com/chansen/p5-unicode-utf8/tree/master/benchmarks/data
    my $dir  = 'benchmarks/data';
    my @docs = do {
        my $d = IO::Dir->new($dir)
          or die qq/Could not open directory '$dir': $!/;
        sort grep { /^[a-z]{2}\.txt/ } $d->read;
    };
    
    printf "perl:                %s (%s %s)\n", $], @Config{qw[osname osvers]};
    printf "Encode:              %s\n", Encode->VERSION;
    printf "Unicode::UTF8:       %s\n", Unicode::UTF8->VERSION;
    printf "PerlIO::encoding:    %s\n", PerlIO::encoding->VERSION;
    printf "PerlIO::utf8_strict: %s\n", PerlIO::utf8_strict->VERSION;
    
    foreach my $doc (@docs) {
    
        my $octets = do {
            open my $fh, '<:raw', "$dir/$doc" or die $!;
            local $/; <$fh>;
        };
    
        my $string = Unicode::UTF8::decode_utf8($octets);
    
        my @ranges = (
            [    0x00,     0x7F, qr/[\x{00}-\x{7F}]/        ],
            [    0x80,    0x7FF, qr/[\x{80}-\x{7FF}]/       ],
            [   0x800,   0xFFFF, qr/[\x{800}-\x{FFFF}]/     ],
            [ 0x10000, 0x10FFFF, qr/[\x{10000}-\x{10FFFF}]/ ],
        );
    
        my @out;
        foreach my $r (@ranges) {
            my ($start, $end, $regexp) = @$r;
            my $count = () = $string =~ m/$regexp/g;
            push @out, sprintf "U+%.4X..U+%.4X: %d", $start, $end, $count
              if $count;
        }
    
        printf "\n\n%s: Size: %d Code points: %d (%s)\n",
          $doc, length $octets, length $string, join ' ', @out;
    
        open my $fh_raw, '<:raw', \$octets 
          or die qq/Could not open a :raw fh: '$!'/;
        open my $fh_encoding, '<:encoding(UTF-8)', \$octets
          or die qq/Could not open a :encoding fh: '$!'/;
        open my $fh_utf8_strict, '<:utf8_strict', \$octets 
          or die qq/Could not open a :utf8_strict fh: '$!'/;
    
        Benchmark::cmpthese( -10, {
            ':encoding(UTF-8)' => sub {
                my $data = do { local $/; <$fh_encoding> };
                seek($fh_encoding, 0, SEEK_SET)
                  or die qq/Could not rewind fh: '$!'/;
            },
            ':utf8_strict' => sub {
                my $data = do { local $/; <$fh_utf8_strict> };
                seek($fh_utf8_strict, 0, SEEK_SET)
                  or die qq/Could not rewind fh: '$!'/;
            },
            'Encode' => sub {
                my $data = Encode::decode('UTF-8', do { local $/; scalar <$fh_raw> }, Encode::FB_CROAK|Encode::LEAVE_SRC);
                seek($fh_raw, 0, SEEK_SET)
                 or die qq/Could not rewind fh: '$!'/;
            },        
            'Unicode::UTF8' => sub {
                my $data = Unicode::UTF8::decode_utf8(do { local $/; scalar <$fh_raw> });
                seek($fh_raw, 0, SEEK_SET)
                 or die qq/Could not rewind fh: '$!'/;
            },
        });
    }
    
    $ perl benchmarks/slurp.pl 
    perl:                5.023001 (darwin 14.4.0)
    Encode:              2.75
    Unicode::UTF8:       0.60
    PerlIO::encoding:    0.21
    PerlIO::utf8_strict: 0.006
    
    
    ar.txt: Size: 25918 Code points: 14308 (U+0000..U+007F: 2698 U+0080..U+07FF: 11610)
                        Rate :encoding(UTF-8)      Encode :utf8_strict Unicode::UTF8
    :encoding(UTF-8)  3058/s               --        -19%         -73%          -87%
    Encode            3754/s              23%          --         -67%          -84%
    :utf8_strict     11361/s             272%        203%           --          -52%
    Unicode::UTF8    23620/s             672%        529%         108%            --
    
    
    el.txt: Size: 103974 Code points: 58748 (U+0000..U+007F: 13560 U+0080..U+07FF: 45150 U+0800..U+FFFF: 38)
                       Rate :encoding(UTF-8)       Encode :utf8_strict Unicode::UTF8
    :encoding(UTF-8)  780/s               --         -19%         -73%          -86%
    Encode            958/s              23%           --         -66%          -83%
    :utf8_strict     2855/s             266%         198%           --          -48%
    Unicode::UTF8    5498/s             605%         474%          93%            --
    
    
    en.txt: Size: 82171 Code points: 82055 (U+0000..U+007F: 81988 U+0080..U+07FF: 18 U+0800..U+FFFF: 49)
                        Rate :encoding(UTF-8)      Encode :utf8_strict Unicode::UTF8
    :encoding(UTF-8)  1111/s               --        -16%         -90%          -96%
    Encode            1327/s              19%          --         -88%          -95%
    :utf8_strict     11446/s             931%        763%           --          -60%
    Unicode::UTF8    28635/s            2478%       2058%         150%            --
    
    
    ja.txt: Size: 180109 Code points: 64655 (U+0000..U+007F: 6913 U+0080..U+07FF: 30 U+0800..U+FFFF: 57712)
                       Rate :encoding(UTF-8)       Encode :utf8_strict Unicode::UTF8
    :encoding(UTF-8)  553/s               --         -27%         -72%          -91%
    Encode            757/s              37%           --         -61%          -87%
    :utf8_strict     1960/s             254%         159%           --          -67%
    Unicode::UTF8    5915/s             970%         682%         202%            --
    
    
    lv.txt: Size: 138397 Code points: 127160 (U+0000..U+007F: 117031 U+0080..U+07FF: 9021 U+0800..U+FFFF: 1108)
                       Rate :encoding(UTF-8)       Encode :utf8_strict Unicode::UTF8
    :encoding(UTF-8)  605/s               --         -19%         -80%          -91%
    Encode            746/s              23%           --         -75%          -88%
    :utf8_strict     3043/s             403%         308%           --          -53%
    Unicode::UTF8    6453/s             967%         765%         112%            --
    
    
    ru.txt: Size: 151633 Code points: 85266 (U+0000..U+007F: 19263 U+0080..U+07FF: 65639 U+0800..U+FFFF: 364)
                       Rate :encoding(UTF-8)       Encode :utf8_strict Unicode::UTF8
    :encoding(UTF-8)  542/s               --         -19%         -73%          -86%
    Encode            673/s              24%           --         -66%          -83%
    :utf8_strict     2001/s             269%         197%           --          -50%
    Unicode::UTF8    4010/s             640%         496%         100%            --
    
    
    sv.txt: Size: 96449 Code points: 92894 (U+0000..U+007F: 89510 U+0080..U+07FF: 3213 U+0800..U+FFFF: 171)
                        Rate :encoding(UTF-8)      Encode :utf8_strict Unicode::UTF8
    :encoding(UTF-8)   923/s               --        -17%         -85%          -93%
    Encode            1109/s              20%          --         -82%          -92%
    :utf8_strict      5998/s             550%        441%           --          -56%
    Unicode::UTF8    13604/s            1374%       1127%         127%            --
    
    
    zh.txt: Size: 62891 Code points: 24519 (U+0000..U+007F: 5317 U+0080..U+07FF: 32 U+0800..U+FFFF: 19170)
                        Rate :encoding(UTF-8)      Encode :utf8_strict Unicode::UTF8
    :encoding(UTF-8)  1630/s               --        -23%         -75%          -87%
    Encode            2104/s              29%          --         -68%          -83%
    :utf8_strict      6549/s             302%        211%           --          -48%
    Unicode::UTF8    12630/s             675%        500%          93%            --