从多个标记中提取innerHTML_Html_Regex_Perl_Text

从多个标记中提取innerHTML

html regex perl text

从多个标记中提取innerHTML,html,regex,perl,text,Html,Regex,Perl,Text,我的任务是用Perl从html链接中提取内部html文本举个例子, <a href="www.stackoverflow.com">Regex Question</a> 我想提取字符串：Regex问题请注意，内部文本可能是空的，如下所示。此示例获取一个空字符串 <a href="www.stackoverflow.com"></a> 内部文本可能包含多个这样的标记 <a href="www.stackoverflow.com">

我的任务是用Perl从html链接中提取内部html文本

举个例子,

<a href="www.stackoverflow.com">Regex Question</a>

我想提取字符串：Regex问题

请注意，内部文本可能是空的，如下所示。此示例获取一个空字符串

<a href="www.stackoverflow.com"></a>

内部文本可能包含多个这样的标记

<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>

我尝试编写Perl正则表达式已有一段时间了，但没有成功。特别是，我不知道如何处理多个标签

像这样的东西怎么样

(?<=>)[^<>\/]*(?=<\/)

将匹配字符串：Regex-Question

例如：

像这样的东西怎么样

(?<=>)[^<>\/]*(?=<\/)

将匹配字符串：Regex-Question

示例：

应该使用html解析器，但可能可以使用正则表达式。这会发现打开到关闭的A标记对没有A标记嵌套，并且允许内容中包含其他标记。如果您希望a-tags内容完全不包含其他标记，则它将稍有不同，不会显示

因为您使用的是Perl，所以这可能会起作用

 # =~ /(?s)<a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)(?<!\/>)((?:(?!(?><a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)|<\/a\s*>)).)*)<\/a\s*>/

 (?s)
 <a                            # Begin A-tag, must (should) contain attrib/val's
 (?>
      \s+                      # (?!\s) add this if you think malformed '<a  >' could slip by
      (?: " .*? " | ' .*? ' | [^>]*? )+
      >
 )
 (?<! /> )                     # Lookbehind, Insure this is not a closed A-tag '<a/>'
 (                             # (1 start), Capture Content between open/close A-tags
      (?:                           # Cluster, match content
           (?!                           # Negative assertion
                (?>
                     <a                            # Not Start A-tag
                     (?>
                          \s+  
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                          >
                     )
                  |  </a \s* >                     #  and Not End A-tag
                )
           )
           .                             # Assert passed, consume a content character 
      )*                            # End Cluster, do 0 to many times
 )                             # (1 end)
 </a \s* >                     # End A-tag

因为您使用的是Perl，所以这可能会起作用

 # =~ /(?s)<a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)(?<!\/>)((?:(?!(?><a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)|<\/a\s*>)).)*)<\/a\s*>/

 (?s)
 <a                            # Begin A-tag, must (should) contain attrib/val's
 (?>
      \s+                      # (?!\s) add this if you think malformed '<a  >' could slip by
      (?: " .*? " | ' .*? ' | [^>]*? )+
      >
 )
 (?<! /> )                     # Lookbehind, Insure this is not a closed A-tag '<a/>'
 (                             # (1 start), Capture Content between open/close A-tags
      (?:                           # Cluster, match content
           (?!                           # Negative assertion
                (?>
                     <a                            # Not Start A-tag
                     (?>
                          \s+  
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                          >
                     )
                  |  </a \s* >                     #  and Not End A-tag
                )
           )
           .                             # Assert passed, consume a content character 
      )*                            # End Cluster, do 0 to many times
 )                             # (1 end)
 </a \s* >                     # End A-tag

试试这个。看演示。抓取捕获或匹配

通过正则表达式解析HTML是个坏主意，你不是Chuck Norris。您可以使用模块，这将使您的任务非常简单

样本：

use Mojo::DOM;

# Parse
my $dom = Mojo::DOM->new('<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>');

# Find
say $dom->at('a')->text;
say $dom->find('a')->text;

要安装Mojo:：DOM，只需键入以下命令

$cpan Mojo:：DOM

通过正则表达式解析HTML是个坏主意，你不是Chuck Norris。您可以使用模块，这将使您的任务非常简单

样本：

use Mojo::DOM;

# Parse
my $dom = Mojo::DOM->new('<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>');

# Find
say $dom->at('a')->text;
say $dom->find('a')->text;

要安装Mojo:：DOM，只需键入以下命令

$cpan Mojo:：DOM

使用HTML解析器解析HTML

我建议你看看，如果你需要从网上下载内容

以下内容将拉取包含stackoverflow.com的href的所有链接，并显示其中的文本：

use strict;
use warnings;

use Mojo::DOM;
use Data::Dump;

my $dom = Mojo::DOM->new(do {local $/; <DATA>});

for my $link ($dom->find('a[href*="stackoverflow.com"]')->each) {
    dd $link->all_text;
}

__DATA__
<html>
<body>
<a href="www.stackoverflow.com">Regex Question</a>
I want to extract the string: Regex Question

<a href="www.notme.com">Don't want this link</a>
Note that, the inner text might be empty like this. This example get an empty string.

<a href="www.stackoverflow.com"></a>
and the inner text might be enclosed with multiple tags like this.

<a href="www.stackoverflow.com"><b><h2>Regex Question with tags</h2></b></a>
</body>
</html>

有关8分钟的介绍视频，请查看。

使用HTML解析器解析HTML

我建议你看看，如果你需要从网上下载内容

以下内容将拉取包含stackoverflow.com的href的所有链接，并显示其中的文本：

use strict;
use warnings;

use Mojo::DOM;
use Data::Dump;

my $dom = Mojo::DOM->new(do {local $/; <DATA>});

for my $link ($dom->find('a[href*="stackoverflow.com"]')->each) {
    dd $link->all_text;
}

__DATA__
<html>
<body>
<a href="www.stackoverflow.com">Regex Question</a>
I want to extract the string: Regex Question

<a href="www.notme.com">Don't want this link</a>
Note that, the inner text might be empty like this. This example get an empty string.

<a href="www.stackoverflow.com"></a>
and the inner text might be enclosed with multiple tags like this.

<a href="www.stackoverflow.com"><b><h2>Regex Question with tags</h2></b></a>
</body>
</html>

有关8分钟的介绍性视频，请查看。

为什么要使用正则表达式而不是解析器？实际上，“处理”它们是什么意思。如果a-标签之间匹配，则它们将匹配，对吗？Perl有一些非常好的html解析器模块可用。为什么使用正则表达式而不是解析器呢？实际上，“处理”它们是什么意思。如果a-标签之间匹配，则它们将匹配，对吗？Perl有一些非常好的html解析器模块可用。这一个看起来很简单，但它与空字符串不匹配。这一个看起来很简单，但与空字符串不匹配。它可以工作，但它也匹配outter标记KBJHKB我不太了解html标记，但是它仍然不匹配helloworld，它可以工作，除了它也匹配outter标记KBJHKB我不太了解html标记，但是它仍然不匹配helloworld。