Perl 基准测试utf8文件读取-差异说明
有以下代码:Perl 基准测试utf8文件读取-差异说明,perl,utf-8,io,encode,Perl,Utf 8,Io,Encode,有以下代码: #!/usr/bin/env perl use 5.016; use warnings; use autodie; use Path::Tiny; use Encode; use Benchmark qw(:all); my $cnt = 10_000; my $utf = 'utf8.txt'; my $res = timethese($cnt, { 'open-UTF-8' => sub { open my $fhu, '<:encod
#!/usr/bin/env perl
use 5.016;
use warnings;
use autodie;
use Path::Tiny;
use Encode;
use Benchmark qw(:all);
my $cnt = 10_000;
my $utf = 'utf8.txt';
my $res = timethese($cnt, {
'open-UTF-8' => sub {
open my $fhu, '<:encoding(UTF-8)', $utf;
my $stru = do { local $/; <$fhu>};
close $fhu;
},
'open-utf8' => sub {
open my $fhu, '<:utf8', $utf;
my $stru = do { local $/; <$fhu>};
close $fhu;
},
'decode-utf8' => sub {
open my $fhu, '<', $utf;
my $stru = decode('utf8', do { local $/; <$fhu>});
close $fhu;
},
'decode-UTF-8' => sub {
open my $fhu, '<', $utf;
my $stru = decode('UTF-8', do { local $/; <$fhu>});
close $fhu;
},
'ptiny' => sub {
my $stru = path($utf)->slurp_utf8;
},
});
cmpthese $res;
在我的笔记本上运行上述命令可以:
Benchmark: timing 10000 iterations of decode-UTF-8, decode-utf8, open-UTF-8, open-utf8, ptiny...
decode-UTF-8: 47 wallclock secs (46.83 usr + 0.87 sys = 47.70 CPU) @ 209.64/s (n=10000)
decode-utf8: 48 wallclock secs (46.62 usr + 0.90 sys = 47.52 CPU) @ 210.44/s (n=10000)
open-UTF-8: 60 wallclock secs (57.82 usr + 1.20 sys = 59.02 CPU) @ 169.43/s (n=10000)
open-utf8: 7 wallclock secs ( 6.57 usr + 0.70 sys = 7.27 CPU) @ 1375.52/s (n=10000)
ptiny: 7 wallclock secs ( 5.98 usr + 0.52 sys = 6.50 CPU) @ 1538.46/s (n=10000)
Rate open-UTF-8 decode-UTF-8 decode-utf8 open-utf8 ptiny
open-UTF-8 169/s -- -19% -19% -88% -89%
decode-UTF-8 210/s 24% -- -0% -85% -86%
decode-utf8 210/s 24% 0% -- -85% -86%
open-utf8 1376/s 712% 556% 554% -- -11%
ptiny 1538/s 808% 634% 631% 12% --
令我惊讶的是,问题是:
- 首先-上述代码是否有问题
- 为什么显式的
和松弛的UTF-8
之间存在巨大差异,但仅在IO层级别(utf8
:utf8 PerlIO层是一个伪层,它只是PerlIO句柄上OP检测到的一个标志。根据使用的OP,行为会有所不同: read()、sysread()和recv(): 不执行utf8序列的验证。执行utf8序列以计数读取的utf8序列的数量 readline(): 验证读取的八位字节类别
是否有效,并在读取的八位字节包含格式错误的utf8时发出警告。使用的验证过程与中使用的相同 “:utf8”标志/层不应用于读取,除非您愿意接受格式错误的UTF-X,这可能导致错误或分段错误 :编码 PerlIO“utf8”
层由其提供,该层为的子类实现增量解码器框架。通过调用每个增量解码的方法,调用Perl/XS子类。缓冲区在层和子类之间复制 utf8与UTF-8 utf8编码格式是联合体指定的UTF-8编码格式的超集。utf8编码格式接受UTF-8编码格式中格式错误的编码代码点,例如和U+10FFFF以上的代码点。即使他们认为是Unicode,也应避免使用。utf8编码不应用于交换,这是Perl的内部编码。请改用UTF-8编码形式 lurping UTF-8编码文件的基准测试 基准测试中使用的模块: ,及 上也提供了以下代码:encoding
她是我的成果(Ubuntu 14.10、联想Edge 540、英特尔(R)Core(TM)i7-4712MQ CPU@2.30GHz):.Input file:@HåkonHægland类似。层上的lazy/strict和lazy/strict
之间也存在巨大差异。路径::tiny令人惊讶。@jm666,原因可能是HåkonHægland没有安装Unicode::UTF8。只需decode
和常规的使用警告qw(致命UTF8)
层非常好。您可能还想分别控制子警告:utf8
代理
,
,非unicode
-这在使用慢速模块时甚至是不可能的。@tchrist,不是,子警告只应用于输出,使用readline()输入它们没有效果。的utf8编码接受格式错误的UTF-8,例如编码的代理。请尝试以下操作:非字符
$perl-E'打开我的$fh,“@tchrist,顺便说一句,并不慢,它实现了对所有子警告的支持
,代理
和非unicode
。您的数字令人印象深刻。值得一提的是,我的意思是非字符
,这是出了名的慢,而且对子警告也没有反应。:编码(UTF-8)
Benchmark: timing 10000 iterations of decode-UTF-8, decode-utf8, open-UTF-8, open-utf8, ptiny... decode-UTF-8: 47 wallclock secs (46.83 usr + 0.87 sys = 47.70 CPU) @ 209.64/s (n=10000) decode-utf8: 48 wallclock secs (46.62 usr + 0.90 sys = 47.52 CPU) @ 210.44/s (n=10000) open-UTF-8: 60 wallclock secs (57.82 usr + 1.20 sys = 59.02 CPU) @ 169.43/s (n=10000) open-utf8: 7 wallclock secs ( 6.57 usr + 0.70 sys = 7.27 CPU) @ 1375.52/s (n=10000) ptiny: 7 wallclock secs ( 5.98 usr + 0.52 sys = 6.50 CPU) @ 1538.46/s (n=10000) Rate open-UTF-8 decode-UTF-8 decode-utf8 open-utf8 ptiny open-UTF-8 169/s -- -19% -19% -88% -89% decode-UTF-8 210/s 24% -- -0% -85% -86% decode-utf8 210/s 24% 0% -- -85% -86% open-utf8 1376/s 712% 556% 554% -- -11% ptiny 1538/s 808% 634% 631% 12% --
#!/usr/bin/perl use strict; use warnings; use Benchmark qw[]; use Config qw[%Config]; use IO::Dir qw[]; use IO::File qw[SEEK_SET]; use Encode qw[]; use Unicode::UTF8 qw[]; use PerlIO::encoding qw[]; use PerlIO::utf8_strict qw[]; # https://github.com/chansen/p5-unicode-utf8/tree/master/benchmarks/data my $dir = 'benchmarks/data'; my @docs = do { my $d = IO::Dir->new($dir) or die qq/Could not open directory '$dir': $!/; sort grep { /^[a-z]{2}\.txt/ } $d->read; }; printf "perl: %s (%s %s)\n", $], @Config{qw[osname osvers]}; printf "Encode: %s\n", Encode->VERSION; printf "Unicode::UTF8: %s\n", Unicode::UTF8->VERSION; printf "PerlIO::encoding: %s\n", PerlIO::encoding->VERSION; printf "PerlIO::utf8_strict: %s\n", PerlIO::utf8_strict->VERSION; foreach my $doc (@docs) { my $octets = do { open my $fh, '<:raw', "$dir/$doc" or die $!; local $/; <$fh>; }; my $string = Unicode::UTF8::decode_utf8($octets); my @ranges = ( [ 0x00, 0x7F, qr/[\x{00}-\x{7F}]/ ], [ 0x80, 0x7FF, qr/[\x{80}-\x{7FF}]/ ], [ 0x800, 0xFFFF, qr/[\x{800}-\x{FFFF}]/ ], [ 0x10000, 0x10FFFF, qr/[\x{10000}-\x{10FFFF}]/ ], ); my @out; foreach my $r (@ranges) { my ($start, $end, $regexp) = @$r; my $count = () = $string =~ m/$regexp/g; push @out, sprintf "U+%.4X..U+%.4X: %d", $start, $end, $count if $count; } printf "\n\n%s: Size: %d Code points: %d (%s)\n", $doc, length $octets, length $string, join ' ', @out; open my $fh_raw, '<:raw', \$octets or die qq/Could not open a :raw fh: '$!'/; open my $fh_encoding, '<:encoding(UTF-8)', \$octets or die qq/Could not open a :encoding fh: '$!'/; open my $fh_utf8_strict, '<:utf8_strict', \$octets or die qq/Could not open a :utf8_strict fh: '$!'/; Benchmark::cmpthese( -10, { ':encoding(UTF-8)' => sub { my $data = do { local $/; <$fh_encoding> }; seek($fh_encoding, 0, SEEK_SET) or die qq/Could not rewind fh: '$!'/; }, ':utf8_strict' => sub { my $data = do { local $/; <$fh_utf8_strict> }; seek($fh_utf8_strict, 0, SEEK_SET) or die qq/Could not rewind fh: '$!'/; }, 'Encode' => sub { my $data = Encode::decode('UTF-8', do { local $/; scalar <$fh_raw> }, Encode::FB_CROAK|Encode::LEAVE_SRC); seek($fh_raw, 0, SEEK_SET) or die qq/Could not rewind fh: '$!'/; }, 'Unicode::UTF8' => sub { my $data = Unicode::UTF8::decode_utf8(do { local $/; scalar <$fh_raw> }); seek($fh_raw, 0, SEEK_SET) or die qq/Could not rewind fh: '$!'/; }, }); }
$ perl benchmarks/slurp.pl perl: 5.023001 (darwin 14.4.0) Encode: 2.75 Unicode::UTF8: 0.60 PerlIO::encoding: 0.21 PerlIO::utf8_strict: 0.006 ar.txt: Size: 25918 Code points: 14308 (U+0000..U+007F: 2698 U+0080..U+07FF: 11610) Rate :encoding(UTF-8) Encode :utf8_strict Unicode::UTF8 :encoding(UTF-8) 3058/s -- -19% -73% -87% Encode 3754/s 23% -- -67% -84% :utf8_strict 11361/s 272% 203% -- -52% Unicode::UTF8 23620/s 672% 529% 108% -- el.txt: Size: 103974 Code points: 58748 (U+0000..U+007F: 13560 U+0080..U+07FF: 45150 U+0800..U+FFFF: 38) Rate :encoding(UTF-8) Encode :utf8_strict Unicode::UTF8 :encoding(UTF-8) 780/s -- -19% -73% -86% Encode 958/s 23% -- -66% -83% :utf8_strict 2855/s 266% 198% -- -48% Unicode::UTF8 5498/s 605% 474% 93% -- en.txt: Size: 82171 Code points: 82055 (U+0000..U+007F: 81988 U+0080..U+07FF: 18 U+0800..U+FFFF: 49) Rate :encoding(UTF-8) Encode :utf8_strict Unicode::UTF8 :encoding(UTF-8) 1111/s -- -16% -90% -96% Encode 1327/s 19% -- -88% -95% :utf8_strict 11446/s 931% 763% -- -60% Unicode::UTF8 28635/s 2478% 2058% 150% -- ja.txt: Size: 180109 Code points: 64655 (U+0000..U+007F: 6913 U+0080..U+07FF: 30 U+0800..U+FFFF: 57712) Rate :encoding(UTF-8) Encode :utf8_strict Unicode::UTF8 :encoding(UTF-8) 553/s -- -27% -72% -91% Encode 757/s 37% -- -61% -87% :utf8_strict 1960/s 254% 159% -- -67% Unicode::UTF8 5915/s 970% 682% 202% -- lv.txt: Size: 138397 Code points: 127160 (U+0000..U+007F: 117031 U+0080..U+07FF: 9021 U+0800..U+FFFF: 1108) Rate :encoding(UTF-8) Encode :utf8_strict Unicode::UTF8 :encoding(UTF-8) 605/s -- -19% -80% -91% Encode 746/s 23% -- -75% -88% :utf8_strict 3043/s 403% 308% -- -53% Unicode::UTF8 6453/s 967% 765% 112% -- ru.txt: Size: 151633 Code points: 85266 (U+0000..U+007F: 19263 U+0080..U+07FF: 65639 U+0800..U+FFFF: 364) Rate :encoding(UTF-8) Encode :utf8_strict Unicode::UTF8 :encoding(UTF-8) 542/s -- -19% -73% -86% Encode 673/s 24% -- -66% -83% :utf8_strict 2001/s 269% 197% -- -50% Unicode::UTF8 4010/s 640% 496% 100% -- sv.txt: Size: 96449 Code points: 92894 (U+0000..U+007F: 89510 U+0080..U+07FF: 3213 U+0800..U+FFFF: 171) Rate :encoding(UTF-8) Encode :utf8_strict Unicode::UTF8 :encoding(UTF-8) 923/s -- -17% -85% -93% Encode 1109/s 20% -- -82% -92% :utf8_strict 5998/s 550% 441% -- -56% Unicode::UTF8 13604/s 1374% 1127% 127% -- zh.txt: Size: 62891 Code points: 24519 (U+0000..U+007F: 5317 U+0080..U+07FF: 32 U+0800..U+FFFF: 19170) Rate :encoding(UTF-8) Encode :utf8_strict Unicode::UTF8 :encoding(UTF-8) 1630/s -- -23% -75% -87% Encode 2104/s 29% -- -68% -83% :utf8_strict 6549/s 302% 211% -- -48% Unicode::UTF8 12630/s 675% 500% 93% --