PERL-搜索文本中的字符串，并获取带有子字符串的结果_Perl

PERL-搜索文本中的字符串，并获取带有子字符串的结果

perl

PERL-搜索文本中的字符串，并获取带有子字符串的结果,perl,Perl,我正在txt文件中搜索，以查找包含一些字符的文本，并将它们移动到另一个文件夹中我正在搜索以下两个关键字： 95-B/A 95-ASB/A 我的代码是这样的第一次编辑：放入整个代码 use warnings; use File::Copy; use File::Basename; my (%count,%countNegative,%countPositive,$i,$j,$key,@keys,@keysNegative,@keysPositive,$token,$tokenNegative

我正在txt文件中搜索，以查找包含一些字符的文本，并将它们移动到另一个文件夹中

我正在搜索以下两个关键字：

95-B/A
95-ASB/A

我的代码是这样的

第一次编辑：放入整个代码

use warnings;
use File::Copy;
use File::Basename;
my (%count,%countNegative,%countPositive,$i,$j,$key,@keys,@keysNegative,@keysPositive,$token,$tokenNegative,$tokenPositive,@tokens,@tokensNegative,@tokensPositive,$totalCount,$negativeCount,$positiveCount,$totalCountNegativeInText,$totalCountPositiveInText);



@files = <*.txt>;
foreach $fileToProcess (@files) {
open(INFILE,"<$fileToProcess") or die("cannot open file");
while (<INFILE>) { 
@tokens = &tokenize($_);
   foreach $token (@tokens) {
      if ($token =~ /[a-zA-Z]/) { 
         $count{$token} = $count{$token} ? $count{$token}+1 : 1;
      }
   }
}

@keys = keys %count;
@keys = sort { $count{$b} <=> $count{$a} } @keys;

for ($i=0;$i<=$#keys;$i++) { 
   if ((lc $keys[$i] eq lc '95-B/A') || (lc $keys[$i] eq lc '95-ASB/A')) {
   $oldlocation = $fileToProcess;
   $newlocation = '95BA';
   File::Copy::move($oldlocation, $newlocation);
   }
}

close(INFILE);
}
exit(0);


use strict;

my $true = 1;
my $false = 0;
my $text = "";
my $word;
# read text
while (<>) { $text .= $_; }
foreach $word (&tokenize($text)) {
   &printText(&rule3(&rule2(&rule1(&makeUnits(&cleanUp($word))))));
}
print "\n";
exit(0);

sub tokenize {
   $_ = $_[0];
   s/\s+/\n/g;
   s/^\n//;
   s/$/\n/;
   s/([.,!?:;,])\n/\n$1\n/g;
   s/\n(["'`])([^\n])/\n$1\n$2/g;
   s/([^\n])(["'`])\n/$1\n$2\n/g;
   s/([^\n])([.,])\n/$1\n$2\n/g;
   s/\n([A-Z])\n\./\n$1./g;
   s/\n\.\n([^"A-Z])/\.\n$1/g;
   s/(\.[A-Z]+)\n\.\n/$1.\n/g;
   s/([^\n])'s\n/$1\n's\n/g;
   s/([^\n])n't\n/$1\nn't\n/g;
   s/([^\n])'re\n/$1\n're\n/g;
   s/\n\$([^\n])/\n\$\n$1/g;
   s/([^\n])%\n/$1\n%\n/g;
   s/Mr\n\.\n/Mr.\n/g;
   return(split(/\n/,$_));
}

sub printText {
   my $i;
   for ($i=0;$i<@_;$i++) {
      print join('',reverse(split(//,&breakUnits($_[$i]))));
   }
   print " ";
}

使用警告；
使用文件：：复制；
使用File:：Basename；
我的（%count，%countNegative，%countNegative，$i，$j，$key，@keys，@keysNegative，@keys，$tokenNegative，$tokenNegative，@tokensNegative，@tokensPositive，$totalCount，$negativeCount，$positiveCount，$totalCountnegativeContext，$totalCountPositiveContext）；
@文件=；
foreach$fileToProcess（@files）{
open（infle，“正则表达式模式匹配将基于子字符串进行匹配。为避免这种情况，请使用\b
匹配“单词边界”
if (($keys[$i] =~ m/\b$tenka\b/) or ($keys[$i] =~ m/\b$tenksba\b/ )) {

我删除了代码中不需要的部分，并对其进行了修改，使其更具可读性。现在问题已经清楚了：%count
几乎是全局的，但每个文件都需要一个新的%count

#!/usr/bin/perl
use warnings;
use strict;

my @files = glob '*.txt';

for my $fileToProcess (@files) {
    my %count;    #   <---- HERE. Declare %count in the loop.
    open my $IN, '<', $fileToProcess or die "Cannot open $fileToProcess: $!";
    while (<$IN>) {
        for my $token (tokenize($_)) {
            if ($token =~ /[a-zA-Z]/) {
                ++$count{$token};     # Ternary ? : not needed.
            }
        }
    }

    my @keys = sort { $count{$b} <=> $count{$a} } keys %count;

    for my $key (@keys) {
        if (lc $key eq lc '95-B/A' or lc $key eq lc '95-ASB/A') {
            print "move $fileToProcess because of $key.\n"
        }
    }
}

sub tokenize {
   $_ = $_[0];
   s/\s+/\n/g;
   s/^\n//;
   s/$/\n/;
   s/([.,!?:;,])\n/\n$1\n/g;
   s/\n(["'`])([^\n])/\n$1\n$2/g;
   s/([^\n])(["'`])\n/$1\n$2\n/g;
   s/([^\n])([.,])\n/$1\n$2\n/g;
   s/\n([A-Z])\n\./\n$1./g;
   s/\n\.\n([^"A-Z])/\.\n$1/g;
   s/(\.[A-Z]+)\n\.\n/$1.\n/g;
   s/([^\n])'s\n/$1\n's\n/g;
   s/([^\n])n't\n/$1\nn't\n/g;
   s/([^\n])'re\n/$1\n're\n/g;
   s/\n\$([^\n])/\n\$\n$1/g;
   s/([^\n])%\n/$1\n%\n/g;
   s/Mr\n\.\n/Mr.\n/g;
   return (split /\n/);
}

！/usr/bin/perl
使用警告；
严格使用；
my@files=glob'*.txt'；
对于我的$fileToProcess（@files）{
我的%count；#您需要显示真实的代码。我确信这不是真的，因为$oldlocation
和$newlocation
在循环内不会更改，并且与哈希键无关。如果这是真实的，那么每次在任意哈希中找到匹配项时，您都会移动相同的文件。这会导致编译错误：在move10K1996.pl第38行的“m/\b$tenka\b”附近找到运算符预期的反斜杠，或在move10K1996.pl第38行的“$tenksba\”附近找到运算符预期的（$keys[$i]=~m/\”反斜杠（在\？之前缺少运算符）。如果我在第一个b之后添加a/它仍然会得到没有/a的。你的解释站不住脚。正则表达式95-b/a
不会匹配一个只有95-b
、有或没有单词边界的字符串。好的，我编辑了我的初始帖子并放了我的全部代码，请看我的回复我放了我的全部代码..我不放它们rds会像你一样搜索@keys，如果你看我编辑的代码，你会发现…太棒了！非常感谢你的解决方案！所以我一直将关键字保持在%计数中，所以每次在第一次找到关键字后，关键字仍然在计数中，所以每个文件都会通过过滤器，对吗？如果我在e令牌循环速度较慢，因为另一个方法使用了更快的哈希值？@AdrFinance:您对问题的解释是正确的。如果操作正确，令牌循环中的比较不应该更慢（甚至可能更快，因为您可以提前退出循环）。如果速度是一个问题，您还可以删除sort，该服务器没有任何用途。
#!/usr/bin/perl
use warnings;
use strict;

my @files = glob '*.txt';

for my $fileToProcess (@files) {
    my %count;    #   <---- HERE. Declare %count in the loop.
    open my $IN, '<', $fileToProcess or die "Cannot open $fileToProcess: $!";
    while (<$IN>) {
        for my $token (tokenize($_)) {
            if ($token =~ /[a-zA-Z]/) {
                ++$count{$token};     # Ternary ? : not needed.
            }
        }
    }

    my @keys = sort { $count{$b} <=> $count{$a} } keys %count;

    for my $key (@keys) {
        if (lc $key eq lc '95-B/A' or lc $key eq lc '95-ASB/A') {
            print "move $fileToProcess because of $key.\n"
        }
    }
}

sub tokenize {
   $_ = $_[0];
   s/\s+/\n/g;
   s/^\n//;
   s/$/\n/;
   s/([.,!?:;,])\n/\n$1\n/g;
   s/\n(["'`])([^\n])/\n$1\n$2/g;
   s/([^\n])(["'`])\n/$1\n$2\n/g;
   s/([^\n])([.,])\n/$1\n$2\n/g;
   s/\n([A-Z])\n\./\n$1./g;
   s/\n\.\n([^"A-Z])/\.\n$1/g;
   s/(\.[A-Z]+)\n\.\n/$1.\n/g;
   s/([^\n])'s\n/$1\n's\n/g;
   s/([^\n])n't\n/$1\nn't\n/g;
   s/([^\n])'re\n/$1\n're\n/g;
   s/\n\$([^\n])/\n\$\n$1/g;
   s/([^\n])%\n/$1\n%\n/g;
   s/Mr\n\.\n/Mr.\n/g;
   return (split /\n/);
}