有没有办法加速这个Perl脚本？_Perl

有没有办法加速这个Perl脚本？

perl

有没有办法加速这个Perl脚本？,perl,Perl,我有这个脚本，它结合了两个基于相同部分的文本 use warnings; use strict; use utf8; use open ':encoding(utf8)'; binmode(STDOUT, ":utf8"); my $f1 = 'input.txt'; my $f2 = 'add.txt'; my $f3 = 'output.txt'; my %ids; my $fh; open $fh, '<', $f2 or die "Can't read the file wi

我有这个脚本，它结合了两个基于相同部分的文本

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';
binmode(STDOUT, ":utf8");

my $f1 = 'input.txt';
my $f2 = 'add.txt';
my $f3 = 'output.txt';

my %ids;
my $fh;

open $fh, '<', $f2 or die "Can't read the file with replacements: $!";
while (<$fh>) {
    chomp;
    my ($name, $id) = split /=/;
    $ids{$name} = $id;
}
close $fh;

open my $fho, '>', $f3 or die "Can't write output file: $!";
open $fh, '<', $f1 or die "Can't read input file: $!";
while (<$fh>) {
    for my $name (keys %ids) {
        s/$name/${name} $ids{$name}/;
    }
    print $fho $_;
}

close $fh;
close $fho;

add.txt-要添加的文本列表

TARGET TEXT 1=ADDITIONAL TEXT 1
TARGET TEXT 2=ADDITIONAL TEXT 2
TARGET TEXT 3=ADDITIONAL TEXT 3
TARGET TEXT 4=ADDITIONAL TEXT 4

output.txt将是：

random text random text, TARGET TEXT 1 ADDITIONAL TEXT 1 — random
textTARGET TEXT 2 ADDITIONAL TEXT 2! random text random text
random text random text random text
TARGET TEXT 3 ADDITIONAL TEXT 3 random text random text TARGET TEXT 4 ADDITIONAL TEXT 4

我有一个相当大的文本文件要组合（~40Mb），脚本的运行速度非常慢。有没有办法加快速度？或者可能有人知道一种可以做完全相同事情的工具。

将模式（键）连接到一个大型正则表达式中：

/(a|b|c|d|...|zzz)/

编译一次大型正则表达式，并将组

$1

用作查找中的键

s/$big_re/$1 . $addtext{$1}/ge;

（

/e

标志使替换成为一个表达式，而不是文本。您正在编写

$1.$text

，但可能需要在表达式中执行其他操作（调用函数、使其小写、添加更多格式等）。请参阅文档，在示例中查找

/e

标志。

这一点您可以非常方便地加快速度：

for my $name (keys %ids) {
    s/$name/${name} $ids{$name}/;
}

将其编译为正则表达式：

my $search = join "|", map {quotemeta} keys %ids; 
   $search = qr/\b($search)\b/;

然后在循环中：

s/$search/$1 $ids{$1}/g;

注意-我已经为分词匹配添加了

\b

，因为这不太可能让您在子字符串和排序顺序上出错。显然，您不需要这样做

但这意味着您没有在每次迭代中进行正则表达式匹配的循环。

带有in循环的循环总是可疑的，尤其是当涉及IO时

while (<$fh>) {
    for my $name (keys %ids) {
        s/$name/${name} $ids{$name}/;
    }
    print $fho $_;
}

如果每个文件包含的密钥占总可能密钥的百分比较低，这将是一个胜利。正则表达式将大大加快速度，因为它将使用比暴力重新扫描每个密钥的文本更高效的算法。它还将在正则表达式引擎中执行，通常比Perl字节码更高效

通过使用其他答案中的建议，并在一个

s//

中完成所有操作，可以提高效率

my $text = do { local $/; <> };

$text =~ s{($all_keys_re)}{$1 $ids{$1}}g;

print $text;

my$text=do{local$/；}；
$text=~s{（$all{u key}}{$1$ids{$1}}}g；
打印$text；

尽管看起来似乎有道理，但上述答案假设替换模式的应用与add.txt中定义的模式顺序无关

原问题应进一步澄清，以便正确回答

例如，

input.txt只能更改一次

，等等

如果add.txt中的一个模式更改了一些行，而add.txt中的另一个模式更改了以前更改的行，该怎么办？

您有一个微妙的错误。如果一行包含相同的键两次，则只会替换第一个键。您的

s//

需要

/g

来修复此问题。

拆分允许“限制”参数允许您在找到拆分字符后停止搜索该字符。这将加快第一个循环的速度。组合正则表达式时有许多细微的边缘情况。我强烈建议您这样做。组合正则表达式时有许多细微的边缘情况。我强烈建议您这样做。感谢您提供了如此全面的解决方案wer.my$text=do{local$/；}；完成了它的工作。我在一个示例文本上测试了该脚本，它的运行速度比以前快了21倍（14s vs 304s）！我还尝试使用Regex:：Assemble，但由于某些原因，该脚本的运行速度较慢（39s）。如果你发现“目标文本1”的替代品也会吃掉“目标文本10”或更多，那么速度可能会提高7倍而不是21倍。举个例子。Schwern在组合正则表达式方面一直是一个明智的声音，请注意。
# Or use File::Slurp or Path::Tiny
my $text = do { local $/; <$fh> };

for my $name (keys %ids) {
    # The /g is important to replace all instances of each key
    $text =~ s/$name/${name} $ids{$name}/g;
}
print $fho $text;

my $all_keys = Regexp::Assemble->new;
$all_keys->add( keys %ids );
my $all_keys_re = $all_keys->re;

# Get all the matched keys at once, the /g is important.
my @matches = $text =~ /($all_keys_re)/g;

# Replace all the matched keys. Use uniq to avoid doing the replacement twice.
for my $match (uniq @matches) {
    # Use /g to replace multiple copies of the same key on a line.
    $text =~ s/$match/$match $ids{$match}/g;
}
print $fho $text;

my $text = do { local $/; <> };

$text =~ s{($all_keys_re)}{$1 $ids{$1}}g;

print $text;