如何使用regexperl提取两个模式之间的文本_Regex_Perl

如何使用regexperl提取两个模式之间的文本

regex perl

如何使用regexperl提取两个模式之间的文本,regex,perl,Regex,Perl,在以下几行中，如何使用REGEX PERL在变量中存储“说明：”和“标记：”之间的行，以及使用什么样的数据类型，字符串或列表或其他什么（我试图用Perl编写一个程序，用Debian包信息提取文本文件的信息，并将其转换为RDF（OWL）文件（本体）说明：用于解码ATSC A/52流的库（开发） liba52是一个免费的库，用于解码ATSC a/52流。A/52标准是用于各种应用，包括数字电视和DVD。它是也称为AC-3 此包包含开发文件。主页：标记：devel:：library，rol

在以下几行中，如何使用REGEX PERL在变量中存储“说明：”和“标记：”之间的行，以及使用什么样的数据类型，字符串或列表或其他什么

（我试图用Perl编写一个程序，用Debian包信息提取文本文件的信息，并将其转换为RDF（OWL）文件（本体）

说明：用于解码ATSC A/52流的库（开发） liba52是一个免费的库，用于解码ATSC a/52流。A/52标准是用于各种应用，包括数字电视和DVD。它是也称为AC-3

此包包含开发文件。主页：

标记：devel:：library，role:：devel lib

到目前为止，我编写的代码是：

#!/usr/bin/perl
open(DEB,"Packages");
open(ONT,">>debianmodelling.txt");

$i=0;
while(my $line = <DEB>)
{

    if($line =~ /Package/)
    {
        $line =~ s/Package: //;
        print ONT '  <package rdf:ID="instance'.$i.'">';
        print ONT    '    <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</name>'."\n";
    }
elsif($line =~ /Priority/)
{
    $line =~ s/Priority: //;
    print ONT '    <priority rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</priority>'."\n";
}

elsif($line =~ /Section/)
{
    $line =~ s/Section: //;
    print ONT '    <Section rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Section>'."\n";
}

elsif($line =~ /Maintainer/)
{
    $line =~ s/Maintainer: //;
    print ONT '    <maintainer rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</maintainer>'."\n";
}

elsif($line =~ /Architecture/)
{
    $line =~ s/Architecture: //;
    print ONT '    <architecture rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</architecture>'."\n";
}
elsif($line =~ /Version/)
{
    $line =~ s/Version: //;
    print ONT '    <version rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</version>'."\n";
}
elsif($line =~ /Provides/)
{
    $line =~ s/Provides: //;
    print ONT '    <provides rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</provides>'."\n";
}
elsif($line =~ /Depends/)
{
    $line =~ s/Depends: //;
    print ONT '    <depends rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</depends>'."\n";
}
elsif($line =~ /Suggests/)
{
    $line =~ s/Suggests: //;
    print ONT '    <suggests rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</suggests>'."\n";
}

elsif($line =~ /Description/)
{
    $line =~ s/Description: //;
    print ONT '    <Description rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Description>'."\n";
}
elsif($line =~ /Tag/)
{
    $line =~ s/Tag: //;
    print ONT '    <Tag rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Tag>'."\n";
    print ONT '  </Package>'."\n\n";
}
$i=$i+1;
}

#/usr/bin/perl
开放式（DEB，“包装”）；
打开（ONT，“>>debianmodeling.txt”）；
$i=0；
while（我的$line=）
{
如果（$line=~/Package/）
{
$line=~s/Package://；
打印字体“；
打印“.$line.”“.\n”；
}
elsif（$line=~/Priority/）
{
$line=~s/优先级：/；
打印“.$line.”“.\n”；
}
elsif（$line=~/Section/）
{
$line=~s/节：/；
打印“.$line.”“.\n”；
}
elsif（$line=~/Maintainer/）
{
$line=~s/维护者：/；
打印“.$line.”“.\n”；
}
elsif（$line=~/Architecture/）
{
$line=~s/Architecture://；
打印“.$line.”“.\n”；
}
elsif（$line=~/Version/）
{
$line=~s/版本：//；
打印“.$line.”“.\n”；
}
elsif（$line=~/Provides/）
{
$line=~s/提供：//；
打印“.$line.”“.\n”；
}
elsif（$line=~/dependens/）
{
$line=~s/取决于：/；
打印“.$line.”“.\n”；
}
elsif（$line=~/建议/）
{
$line=~s/建议：/；
打印“.$line.”“.\n”；
}
elsif（$line=~/Description/）
{
$line=~s/Description://；
打印“.$line.”“.\n”；
}
elsif（$line=~/Tag/）
{
$line=~s/Tag://；
打印“.$line.”“.\n”；
打印字体“”。“\n\n”；
}
$i=$i+1；
}

或

附加的

如果您的描述和标记可能位于单独的行上，则可能需要使用

/s

修饰符将其视为一行，这样

\n

就不会破坏它。例如：

$_=qq{Description:foo 
      more description on 
      new line Tag: some
      tag};
s/Description:(.*?)Tag:/$1/s; #notice the trailing slash
print;

假设：

my $example; # holds the example text above

你可以：

(my $result=$example)=~s/^.*?\n(Description:)/$1/s; # strip up to first marker

$result=~s/(\nTag:[^\n]*\n).+$/$1/s; # strip everything after second marker line

或

两者都假定Tag:value包含在一行中

如果不是这样，您可以尝试：

(my $result=$example)=~s/
    (                        # start capture
        Description:         # literal 'Description:'
        .+?                  # any chars (non-greedy) up to
        Tag:                 # literal 'Tag:'
        .+?                  # any chars up to
    )
    (?:                      # either
      \n[A-Z][a-z]+\:        #  another tagged value name 
    |                         # or
      $                       #  end of string
    )
/$1/sx;

我相信这个问题是由于对由段落构成的数据使用行读取循环造成的。如果您可以将文件拖到内存中，并使用捕获的分隔符应用拆分，处理将更加顺畅：

#!/usr/bin/perl -w

use strict;
use diagnostics;
use warnings;

use English;

# simple sample sub
my $printhead = sub {
  printf "%5s got the tag '%s ...'\n", '', substr( shift, 0, 30 );
};
# map keys/tags? to functions
my %tagsoups = (
    'PackageName' => sub {printf "%5s got the name '%s'\n", '', shift;}
  , 'Description' => sub {printf "%5s got the description:\n---------\n%s\n----------\n", '', shift;}
  , 'Tag'         => $printhead
);
# slurp Packages (fallback: parse using $INPUT_RECORD_SEPARATOR = "Package:")
open my $fh, "<", './Packages-00.txt' or die $!;
local $/; # enable localized slurp mode
my $all = <$fh>;
my @pks = split /^(Package):\s+/ms, $all;
close $fh;
# outer loop: Packages
for (my $p = 1, my $n = 0; $p < scalar @pks; $p +=2) {
  my $blk = "PackageName: " . $pks[$p + 1];
  my @inf = split /\s*^([\w-]+):\s+/ms, $blk;
  printf "%3d %s named %s\n", ++$n, $pks[$p], $inf[ 2 ];
  # outer loop: key-value-pairs (or whatever they are called)
  for (my $x = 1; $x < scalar @inf; $x += 2) {
      if (exists($tagsoups{$inf[ $x ]})) {
          $tagsoups{$inf[ $x ]}($inf[$x + 1]);
      }
  }
}

对要应用于提取部分的函数使用散列将使生成xml的细节不在解析器循环中。

因为选择最佳数据类型完全取决于数据的预期用途，你需要对你的目标做出一些解释。@Rob Raisch:我为没有把问题放在开头而道歉。这样可以吗？@Rob我只需要存储一个变量，将其复制到一个文件中。好的，刚刚发现，实际上我一直在问一些关于大型项目开始的问题，所以很难在短时间内测试所有答案。一个HTML解析器到底如何减少工作量？我很抱歉问了一个新手问题，因为我对语义web的世界完全陌生。对不起，我以为你在解析XML文档。我已经把它从我的答案中删除了。见上文。尽管如此，看起来您正在构建一个XML文档，所以也许您毕竟可以使用HTML/XML包。+1。但是，第三个实例（

（my$matched=$line）=~s/$desc（.*？$tag/$1/；

）对我不起作用；它只是从

$line

中剥离了

$desc

和

$tag

的内容；因此，

$matched

包含了该行的剩余内容。@user001是的，我不知道当时在想什么，这些示例中的大多数都是在我旅行时从noggin'最上面掉下来的

s/$desc（.*？）$tag/$1/

正在进行替换。要使其产生任何影响，需要包括该行的其余部分：

s/*$desc（.*？$tag.*/$1/

（我认为这会起作用）

(my $result=$example)=~s/^.*?\n(Description:)/$1/s; # strip up to first marker

$result=~s/(\nTag:[^\n]*\n).+$/$1/s; # strip everything after second marker line

(my $result=$example)=~s/^.*?\n(Description:.+?Tag:[^\n]*\n).*$/$1/s;

(my $result=$example)=~s/
    (                        # start capture
        Description:         # literal 'Description:'
        .+?                  # any chars (non-greedy) up to
        Tag:                 # literal 'Tag:'
        .+?                  # any chars up to
    )
    (?:                      # either
      \n[A-Z][a-z]+\:        #  another tagged value name 
    |                         # or
      $                       #  end of string
    )
/$1/sx;

#!/usr/bin/perl -w

use strict;
use diagnostics;
use warnings;

use English;

# simple sample sub
my $printhead = sub {
  printf "%5s got the tag '%s ...'\n", '', substr( shift, 0, 30 );
};
# map keys/tags? to functions
my %tagsoups = (
    'PackageName' => sub {printf "%5s got the name '%s'\n", '', shift;}
  , 'Description' => sub {printf "%5s got the description:\n---------\n%s\n----------\n", '', shift;}
  , 'Tag'         => $printhead
);
# slurp Packages (fallback: parse using $INPUT_RECORD_SEPARATOR = "Package:")
open my $fh, "<", './Packages-00.txt' or die $!;
local $/; # enable localized slurp mode
my $all = <$fh>;
my @pks = split /^(Package):\s+/ms, $all;
close $fh;
# outer loop: Packages
for (my $p = 1, my $n = 0; $p < scalar @pks; $p +=2) {
  my $blk = "PackageName: " . $pks[$p + 1];
  my @inf = split /\s*^([\w-]+):\s+/ms, $blk;
  printf "%3d %s named %s\n", ++$n, $pks[$p], $inf[ 2 ];
  # outer loop: key-value-pairs (or whatever they are called)
  for (my $x = 1; $x < scalar @inf; $x += 2) {
      if (exists($tagsoups{$inf[ $x ]})) {
          $tagsoups{$inf[ $x ]}($inf[$x + 1]);
      }
  }
}

  3 Package named abrowser-3.5-branding
      got the PackageName:
---------
abrowser-3.5-branding
----------
      got the Description:
---------
dummy upgrade package for firefox-3.5 -> firefox
 This is a transitional package so firefox-3.5 users get firefox on
 upgrades. It can be safely removed.
----------
  4 Package named casper
      got the PackageName:
---------
casper
----------
      got the Description:
---------
Run a "live" preinstalled system from read-only media
----------
      got the Tag:
---------
admin::boot, admin::filesystem, implemented-in::shell, protocol::smb, role::plugin, scope::utility, special::c
ompletely-tagged, works-with-format::iso9660
----------