Regex 提取结束表达式依赖于开始表达式的文本块_Regex_Perl

Regex 提取结束表达式依赖于开始表达式的文本块

regex perl

Regex 提取结束表达式依赖于开始表达式的文本块,regex,perl,Regex,Perl,我有一个如下结构的文本字符串： = Some Heading (1) Some text == Some Sub-Heading (2) Some more text === Some Sub-sub-heading (3) Some details here = Some other Heading (4) 我想提取第二个标题的内容，包括任何小节。我事先不知道第二个航向的深度是多少，所以我需要从那里匹配到下一个相同深度的航向，或者更浅的航向，或者是弦的末端在上述示例中，这将产生

我有一个如下结构的文本字符串：

= Some Heading (1)

Some text

== Some Sub-Heading (2)

Some more text

=== Some Sub-sub-heading (3)

Some details here

= Some other Heading (4)

我想提取第二个标题的内容，包括任何小节。我事先不知道第二个航向的深度是多少，所以我需要从那里匹配到下一个相同深度的航向，或者更浅的航向，或者是弦的末端

在上述示例中，这将产生：

== Some Sub-Heading (2)

Some more text

=== Some Sub-sub-heading (3)

Some details here

这就是我被卡住的地方。如何使用匹配的子表达式打开第二个标题作为子表达式的一部分来关闭节。

#/usr/bin/perl
#!/usr/bin/perl

my $all_lines = join "", <>;

# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\1 matches the 1st matched group)
if ( $all_lines =~ /(=+ Heading )\([2]\)(.*?)\1/s ) {
    print "$2";
}

我的$all_行=加入“”；
#匹配以（2）结尾的标题，并阅读该匹配之间的所有内容
#和相同深度的下一个标题（\1匹配第一个匹配的组）
如果（$所有行=~/（=+标题）\（[2]\）（.*）\1/s）{
打印“$2”；
}

此操作将文件拆分为多个部分：

my @all = split /(?=^= )/m, join "", <$filehandle>;
shift @all;

my@all=split/（？=^=）/m，加入“”；
转移@all；

我不想尝试使用复杂的正则表达式。相反，编写一个简单的解析器并构建一个树

这是一个粗略的、现成的实现。它只针对惰性编码进行了优化。您可能希望使用CPAN中的库来构建解析器和树节点

#!/usr/bin/perl

use strict;
use warnings;

my $document = Node->new();
my $current = $document;

while ( my $line = <DATA> ) {

    if ( $line =~ /^=+\s/ ) {

        my $current_depth = $current->depth;
        my $line_depth = Node->Heading_Depth( $line );

        if ( $line_depth > $current_depth ) {
            # child node.
            my $line_node = Node->new();
            $line_node->heading( $line );
            $line_node->parent( $current );
            $current->add_children( $line_node );
            $current = $line_node;
        }
        else {

            my $line_node = Node->new();
            while ( my $parent = $current->parent ) {

                if ( $line_depth == $current_depth ) {
                    # sibling node.
                    $line_node->heading( $line );
                    $line_node->parent( $parent );
                    $current = $line_node;
                    $parent->add_children( $current );

                    last;
                }

                # step up one level.
                $current = $parent;
            }
        }

    }
    else {
        $current->add_children( $line );
    }


}

use Data::Dumper;
print Dumper $document;

BEGIN {
    package Node;
    use Scalar::Util qw(weaken blessed );

    sub new {
        my $class = shift;

        my $self = {
            children => [],
            parent   => undef,
            heading  => undef,
        };

        bless $self, $class;
    }

    sub heading {
        my $self = shift;
        if ( @_ ) {
            $self->{heading} = shift;
        }
        return $self->{heading};
    }

    sub depth {
        my $self = shift;

        return $self->Heading_Depth( $self->heading );
    }

    sub parent {
        my $self = shift;
        if ( @_ ) {
            $self->{parent} = shift;
            weaken $self->{parent};
        }
        return $self->{parent};
    }

    sub children {
        my $self = shift;
        return @{ $self->{children} || [] };
    }

    sub add_children {
        my $self = shift;
        push @{$self->{children}}, @_;
    }

    sub stringify {
        my $self = shift;

        my $text = $self->heading;
        foreach my $child ( $self->children ) {
            no warnings 'uninitialized';
            $text .= blessed($child) ? $child->stringify : $child;
        }

        return $text;
    }

    sub Heading_Depth {
        my $class  = shift;
        my $heading = shift || '';

        $heading =~ /^(=*)/;
        my $depth = length $1;


        return $depth;
    }

}

__DATA__
= Heading (1)

Some text

= Heading (2)

Some more text

== Subheading (3)

Some details here

== Subheading (3)

Some details here

= Heading (4)

#/usr/bin/perl
严格使用；
使用警告；
my$document=节点->新建（）；
my$current=$document；
while（我的$line=）{
如果（$line=~/^=+\s/）{
我的$current\u depth=$current->depth；
我的$line\u depth=Node->Heading\u depth（$line）；
如果（$line\u depth>$current\u depth）{
#子节点。
我的$line_node=node->new（）；
$line\u节点->标题（$line）；
$line\u node->parent（$current）；
$current->add_子节点（$line_节点）；
$current=$line\u节点；
}
否则{
我的$line_node=node->new（）；
while（我的$parent=$current->parent）{
如果（$line\u depth==$current\u depth）{
#同级节点。
$line\u节点->标题（$line）；
$line\u node->parent（$parent）；
$current=$line\u节点；
$parent->add_children（$current）；
最后；
}
#上一层楼。
$current=$parent；
}
}
}
否则{
$current->add_children（$line）；
}
}
使用数据：：转储程序；
打印转储文件；
开始{
包节点；
使用Scalar：：Util qw（函数）；
次新{
我的$class=shift；
我的$self={
children=>[]，
父项=>未定义，
标题=>未定义，
};
祝福$self，$class；
}
副标题{
我的$self=shift；
如果(如果){
$self->{heading}=shift；
}
返回$self->{heading}；
}
亚深度{
我的$self=shift；
返回$self->Heading\u Depth（$self->Heading）；
}
子父母{
我的$self=shift；
如果(如果){
$self->{parent}=shift；
削弱$self->{parent}；
}
返回$self->{parent}；
}
子弟{
我的$self=shift；
return@{$self->{children}| |[]}；
}
子添加子对象{
我的$self=shift；
推送{$self->{children}}；
}
次级串接{
我的$self=shift；
我的$text=$self->heading；
foreach my$child（$self->children）{
没有“未初始化”的警告；
$text.=受祝福的（$child）？$child->stringify:$child；
}
返回$text；
}
副标题深度{
我的$class=shift；
我的$heading=shift | |“”；
$heading=~/^（=*）/；
my$depth=长度$1；
返回$depth；
}
}
__资料__
=标题（1）
一些文本
=标题（2）
更多的文字
==第（3）子目
这里有一些细节
==第（3）子目
这里有一些细节
=标题（4）

道蟾蜍和jrockway绝对正确。如果您试图解析一个树状的数据结构，将正则表达式随意弯曲只会导致一个脆弱的、不可理解的、仍然不够一般的、复杂的代码块

不过，如果你坚持的话，这里有一个经过修改的代码片段。匹配到相同的深度分隔符或字符串结尾是一个复杂的问题。匹配深度小于或等于当前深度的字符串更具挑战性，需要两个步骤

#!/usr/bin/perl

my $all_lines = join "", <>;
# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\2 matches the 2nd parenthesized group)
if ( $all_lines =~ m/((=+) [^\n]*\(2\)(.*?))(\n\2 |\z)/s ) {
    # then trim it down to just the point before any heading at lesser depth
    my $some_lines = $1;
    my $depth = length($2);
    if ($some_lines =~ m/(.*?)(\n={1,$depth} |\z)/s) {
        print "$1\n";
    }
}

#/usr/bin/perl
我的$all_行=加入“”；
#匹配以（2）结尾的标题，并阅读该匹配之间的所有内容
#和相同深度的下一个标题（\2与第二个括号中的组匹配）
if（$all_line=~m/（=+）[^\n]*\（2\）（.*）（\n\2\\z）/s）{
#然后将其修剪到较小深度的任何航向之前的点
我的$some_行=$1；
我的$depth=长度（$2）；
if（$some_line=~m/（.*）（\n={1，$depth}\z）/s）{
打印“$1\n”；
}
}

但我的建议是避免这种路由，并用可读和可维护的东西来解析它

只是一个傻笑：

/^(?>(=+).*\(2\))(?>[\r\n]+(?=\1=|[^=]).*)*/m

向前看可以确保，如果一行以等号开头，则至少比原始行的前缀中多出一个等号。请注意，前瞻的第二部分匹配除等号以外的任何字符，包括换行符或回车符。这可以让它匹配一个空行，但不能匹配字符串的结尾。

我不太确定您想做什么，但这听起来像是您不想使用regexen的情况。我建议您使用Parse:：recdence（如果您更喜欢LALR（1）解析器而不是递归下降解析器，那么建议使用Parse:：Yapp）。您的更新稍微澄清了这一点，但仍然很模糊。我认为最简单的方法是把它解析成一棵树，然后从中提取出你想要的数据。谢谢你，但这并不能解决这样的情况，即块将被一个比起始块浅的标题终止。我将编辑此问题以强调这一点。