Regex while-a循环阅读词典的优化_Regex_Perl_Dictionary_While Loop

Regex while-a循环阅读词典的优化

regex perl dictionary

Regex while-a循环阅读词典的优化,regex,perl,dictionary,while-loop,Regex,Perl,Dictionary,While Loop,大家好这是我的第一个问题，我正在使用一个名为MElt的开源程序，该程序对单词进行元素化（给出引理示例：giving-->give）。MElt在linux上工作，并用Perl和Python编程。到目前为止，它运行良好，但要给出结果需要花费太多的时间。我查看了代码并找到了负责此操作的循环： while (<LEFFF>) { chomp; s/ /_/g; # s/(\S)-(\S)/\1_-_\2/g; /^(.*?)\t(.*?)\t(.*?)(\t|$)/ |

大家好这是我的第一个问题，我正在使用一个名为MElt的开源程序，该程序对单词进行元素化（给出引理示例：giving-->give）。MElt在linux上工作，并用Perl和Python编程。到目前为止，它运行良好，但要给出结果需要花费太多的时间。我查看了代码并找到了负责此操作的循环：

while (<LEFFF>) { 
  chomp;
  s/ /_/g;
#  s/(\S)-(\S)/\1_-_\2/g;
  /^(.*?)\t(.*?)\t(.*?)(\t|$)/ || next;
  $form = $1; $cats = $2; $lemma = $3;
  #print "$form \n";
  #print "$cats \n";
  #print "$lemma \n";
  if ($lower_case_lemmas) {
    $lemma = lc($lemma);
  }
  if ($it_mapping) {
    next if ($form =~ /^.+'$/);
    next if ($form eq "dato" && $lemma eq "datare"); # bourrin
    next if ($form eq "stato" && $lemma eq "stare"); # bourrin
    next if ($form eq "stata" && $lemma eq "stare"); # bourrin
    next if ($form eq "parti" && $lemma eq "parto"); # bourrin
    if ($cats =~ /^(parentf|parento|poncts|ponctw)$/) {$cats = "PUNCT"}
    if ($cats =~ /^(PRO)$/) {$cats = "PRON"}
    if ($cats =~ /^(ARTPRE)$/) {$cats = "PREDET"}
    if ($cats =~ /^(VER|ASP|AUX|CAU)$/) {$cats = "VERB"}
    if ($cats =~ /^(CON)$/) {$cats = "CONJ"}
    if ($cats =~ /^(PRE)$/) {$cats = "PREP"}
    if ($cats =~ /^(DET)$/) {$cats = "ADJ"}
    if ($cats =~ /^(WH)$/) {$cats = "PRON|CONJ"}
    next if ($form =~ /^(una|la|le|gli|agli|ai|al|alla|alle|col|dagli|dai|dal|dalla|dalle|degli|dei|del|della|delle|dello|nei|nel|nella|nelle|nello|sul|sulla)$/ && $cats eq "ART");
    next if ($form =~ /^quest[aei]$/ && $cats eq "ADJ");
    next if ($form =~ /^quest[aei]$/ && $cats eq "PRON");
    next if ($form =~ /^quell[aei]$/ && $cats eq "ADJ");
    next if ($form =~ /^quell[aei]$/ && $cats eq "PRON");
    next if ($form =~ /^ad$/ && $cats eq "PREP");
    next if ($form =~ /^[oe]d$/ && $cats eq "CONJ");
  }
  $qmlemma = quotemeta ($lemma);
  for $cat (split /\|/, $cats) {
    if (defined ($cat_form2lemma{$cat}) && defined ($cat_form2lemma{$cat}{$form}) && $cat_form2lemma{$cat}{$form} !~ /(^|\|)$qmlemma(\||$)/) {
      $cat_form2lemma{$cat}{$form} .= "|$lemma";
    } else {
      $cat_form2lemma{$cat}{$form} = "$lemma";
      $form_lemma_suffs = "@".$form."###@".$lemma;
      while ($form_lemma_suffs =~ s/^(.)(.+)###\1(.+)/\2###\3/) {
    if (length($2) <= 8) {
      $cat_formsuff_lemmasuff2count{$cat}{$2}{$3}++;
      if ($multiple_lemmas) {
        $cat_formsuff_lemmasuff2count{$cat}{$2}{__ALL__}++;
      }
    }
      }
    }
  }
}

while（）{
咀嚼；
s//ug/g；
#s/（\s）-（\s）/\1\u2/g；
/^（.*）\t（.*）\t（.*）（.*）（\t |$）/| |下一步；
$form=$1；$cats=$2；$lemma=$3；
#打印“$form\n”；
#打印“$cats\n”；
#打印“$lemma\n”；
if（$lower\u case\u引理）{
$lemma=lc（$lemma）；
}
如果（$it\U映射）{
下一个if（$form=~/^.+'$/）；
下一个if（$form eq“dato”和&$lemma eq“datare”）；#bourrin
下一个if（$form eq“stato”和&$lemma eq“stare”）；#bourrin
下一个if（$form eq“stata”和&$lemma eq“stare”）；#bourrin
下一个if（$form eq“parti”和&$lemma eq“parto”）；#bourrin
如果（$cats=~/^（parentf | parento | poncts | ponctw）$/）{$cats=“PUNCT”}
如果（$cats=~/^（PRO）$/）{$cats=“PRON”}
如果（$cats=~/^（ARTPRE）$/）{$cats=“PREDET”}
如果（$cats=~/^（VER | ASP | AUX | CAU）$/）{$cats=“动词”}
如果（$cats=~/^（CON）$/）{$cats=“CONJ”}
如果（$cats=~/^（PRE）$/）{$cats=“PREP”}
如果（$cats=~/^（DET）$/）{$cats=“ADJ”}
如果（$cats=~/^（WH）$/）{$cats=“PRON | CONJ”}
下一个if（$form=~/^（una | la | le | gli | agli | ai | al | alle | col | dali | dal | dalla | dalle degli | della | delle | dello | neil | nella | nelle | sulla | | | | | | | | | nelle；
下一个if（$form=~/^quest[aei]$/&&$cats eq“ADJ”）；
下一个if（$form=~/^quest[aei]$/&&$cats eq“PRON”）；
下一个if（$form=~/^quell[aei]$/&&$cats eq“ADJ”）；
下一个if（$form=~/^quell[aei]$/&&$cats eq“PRON”）；
下一个if（$form=~/^ad$/&&$cats eq“PREP”）；
下一个if（$form=~/^[oe]d$/&&$cats eq“CONJ”）；
}
$qmlema=quotemeta（$lemma）；
对于$cat（拆分/\\\124;/，$cat）{
if（已定义（$cat|form2lemma{$cat}）&&defined（$cat|form2lemma{$form}）&&cat|form2lemma{$cat}{$form}！~/（^ 124;\\\\\\）$qmlemma（\\\\\\\$）/）{
$cat_form2lemma{$cat}{$form}.=“|$lemma”；
}否则{
$cat_form2lemma{$cat}{$form}=“$lemma”；
$form#lemma_suffs=“@”$form.####@“$lemma；
而（$form_lemma_suffs=~s/^（.+）#####\1（+）/\2####\3/）{
如果（长度（$2）尝试将此行/^（.*？）\t（.*？）\t（.*？）/|next；
更改为：
/^([^\t]++)\t([^\t]++)\t([^\t]++)(\t|$)/ || next;

对于下一个正则表达式，删除所有未编号的捕获括号
/^（parentf | parento | poncts | ponctw）$/
到
/^parent[fo]|ponct[sw]$/   or why not   /^p(?>arent[fo]|onct[sw])$/

/（乌纳·拉格利·阿格利·艾勒·阿勒·科尔·达格利·达勒·达勒·德格利·德勒·德勒·德勒·德勒·内尔·内尔·内尔·内尔·内尔·内尔·苏莱尔）

/^parent[fo]|ponct[sw]$/   or why not   /^p(?>arent[fo]|onct[sw])$/

/^（>una | l[ae]| a（？>i | l（？>l[ae]）col | d（？>ello |[ae]（？>i | l（？）l[ae]）| ne（？>i | l（？>ll[aeo]）| sul（？>la）$/

（注意：您可以通过重新排序来改进这一行，将最常用的行列式/articolo放在开头）
尝试更改此行：
while ($form_lemma_suffs =~ s/^(.)(.+)###\1(.+)/\2###\3/)

借
您可以执行以下操作：
next if ($form =~ /^quest[aei]$/ && $cats eq "ADJ");

到
（实验性）可以替换这两行：
next if ($form eq "stato" && $lemma eq "stare"); # bourrin
next if ($form eq "stata" && $lemma eq "stare"); # bourrin

借
重要提示：使用perl，您可以编译正则表达式，这在您的情况下非常有用，因为您在while循环中使用相同的正则表达式。如果这样做，请不要忘记将正则表达式定义放在循环之外！示例：
my $regex = qr/^parent[fo]|ponct[sw]$/;
while (<LEFFF>) {
...
if ($cats =~ $regex) {$cats = "PUNCT"}

my$regex=qr/^parent[fo]| ponct[sw]$/；
而（）{
...
如果（$cats=~$regex）{$cats=“PUNCT”}
你最好在这里发帖：OP现在有了。我试过了，但还是一样（我的意思是，正则表达式不再识别单词了）。我认为问题在于程序正在将490489个单词与句子中的每个单词进行比较，（490489*5个单词=大约2500万次迭代）.句子越大花费的时间就越多。Re:/^una|l[ae]|a（？>i|l（？>l[a..
这实际上会减慢匹配速度。常量字符串的简单交替被优化为trie数据结构，允许极快的查找。结果与您编写的正则表达式类似，但运行开销要小得多。唯一正确的优化是删除（…）
捕获组（通过将其更改为非捕获的（？：…））谢谢大家的回答…我尝试了您建议的程序，但执行时间仍然相同：（.即使使用预编译的正则表达式，程序仍会读取字典中的所有500000个条目（Leff）.各位，我现在在想什么，我想听听你们对它的看法：LEFFF文件（如前所示）甚至没有排序（因此我们可以对其应用二进制搜索算法）。现在我有了生成LEFFF的脚本，或者我可以编写一个脚本来排序。我的想法是找到一个哈希函数，在LEFFF上应用它来生成som
next if ($lemma eq "stare" && ($form eq "stato" || $form eq "stata"));

my $regex = qr/^parent[fo]|ponct[sw]$/;
while (<LEFFF>) {
...
if ($cats =~ $regex) {$cats = "PUNCT"}