Mojo:：domhtml提取_Html_Perl_Mojo Dom

Mojo:：domhtml提取

html perl

Mojo:：domhtml提取,html,perl,mojo-dom,Html,Perl,Mojo Dom,我正试图从一个结构完美的网页中提取相当多的数据，并努力使用各种方法。如果有人能给我指出正确的方向，我将不胜感激包含有趣数据的截断HTML如下所示： <div class="post" data-story-id="3964117" data-visited="false">//extracting story-id <h2 class="post_title page_title"><a href="http://example.com/story/some

我正试图从一个结构完美的网页中提取相当多的数据，并努力使用各种方法。如果有人能给我指出正确的方向，我将不胜感激

包含有趣数据的截断HTML如下所示：

 <div class="post" data-story-id="3964117" data-visited="false">//extracting story-id
  <h2 class="post_title page_title"><a href="http://example.com/story/some_url" class="to-comments">header.</a></h2>
  //useless data and tags

<a href="http://example.com/story/some_url" class="b-story__show-all">
  <span>useless data</span>
</a>

<div class="post_tags">
  <ul>
    <li class="post_tag post_tag_strawberry hidden"><a href="http://example.com/search.php?n=32&r=3">&nbsp;</a></li>
    <li class="post_tag"><a href="http://example.com/tag/tag1/hot">tag1</a></li>
    <li class="post_tag"><a href="http://example.com/tag/tag2/hot">tag2</a></li>
    <li class="post_tag"><a href="http://example.com/tag/tag1/hot">tag3</a></li>
  </ul>
</div>

<div class="post_actions_box">

  <div class="post_rating_box">
    <ul data-story-id="3964117" data-vote="0" data-can-vote="true">
      <li><span class="post_rating post_rating_up control">&nbsp;</span></li>
      <li><span class="post_rating_count control label">1956</span></li> //1956 - interesting value
      <li><span class="post_rating post_rating_down control">&nbsp;</span></li>
    </ul>
  </div>

  <div class="post_more_box">
    <ul>
      <li>
        <span class="post_more control">&nbsp;</span>
      </li>
      <li>
        <a class="post_comments_count label to-comments" href="http://example.com/story/some_url#comments">132&nbsp;<i>&nbsp;</i></a>
      </li>
    </ul>
  </div>

</div>
</div>

未提取“后分级计数控制标签”。通过搜索

a.to-comments

并返回

attr（'href'）

，我可以获得第一个href值，但出于某种原因，它也会使用

class=“post\u comments\u count label to comments”

返回段末链接的值。标题值提取也会发生同样的情况

最后，我要寻找一个具有以下字段的数据结构的数组：

故事id（这是一个成功）
href（不知何故，匹配的内容超出了需要。）
标题（以某种方式，匹配的内容超出了需要。）
标签列表作为字符串（不知道如何做）

更重要的是，我觉得可以优化代码，让它看起来更好一点，但我的功夫没有那么强。

正如我在评论中所说，你的HTML格式不正确。我已经猜到丢失的

可能会去哪里，但我可能错了。我假设数据中的最后一个

对应于第一个

，因此整个块构成一个post

您遇到的主要问题是试图在

对象的Mojo:：Collection
方法调用中执行所有操作。使用Perl迭代每个集合要容易得多，如下所示
use strict;
use warnings;

use Mojo::DOM;

use constant HTML_FILE => 'index2.html';

my $html = do {
    open my $fh, '<', HTML_FILE or die $!;
    local $/;
    <$fh>;
};

my $dom = Mojo::DOM->new($html);

for my $post ( $dom->find('div.post')->each ) {

    printf "Post ID:     %s\n", $post->attr('data-story-id');

    my $anchor = $post->at('h2.post_title > a');
    printf "Post href:   %s\n", $anchor->attr('href');
    printf "Post header: %s\n", $anchor->text;

    my @tags = $post->find('li.post_tag > a')->map('text')->each;

    printf "Tags:        %s\n", join ', ', @tags;

    print "\n";
}

您的HTML格式不正确。您有一个多余的结束
标记，这让我怀疑您是否删除了“无用内容和html结构”中的开始
标记。你能修好吗？是的，真的，很抱歉。更新了代码，谢谢。难道没有一个元素包含每个帖子的所有数据吗？换句话说，下一个
是否真的在之后立即进行？请阅读刚刚进行的测试。我可以在“答案”部分问你几个问题吗？
3964117
Header
132

use strict;
use warnings;

use Mojo::DOM;

use constant HTML_FILE => 'index2.html';

my $html = do {
    open my $fh, '<', HTML_FILE or die $!;
    local $/;
    <$fh>;
};

my $dom = Mojo::DOM->new($html);

for my $post ( $dom->find('div.post')->each ) {

    printf "Post ID:     %s\n", $post->attr('data-story-id');

    my $anchor = $post->at('h2.post_title > a');
    printf "Post href:   %s\n", $anchor->attr('href');
    printf "Post header: %s\n", $anchor->text;

    my @tags = $post->find('li.post_tag > a')->map('text')->each;

    printf "Tags:        %s\n", join ', ', @tags;

    print "\n";
}