Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/git/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Perl 解析HTML以获取两个集合元素之间的元素_Perl_Css Selectors_Html Parsing - Fatal编程技术网

Perl 解析HTML以获取两个集合元素之间的元素

Perl 解析HTML以获取两个集合元素之间的元素,perl,css-selectors,html-parsing,Perl,Css Selectors,Html Parsing,我试图解析一个页面,比如,我只是想得到标题后面的段落,我猜是介绍 我想要和之间的所有内容(包括段落标记)。使用简单的CSS选择器甚至获得第一段: div#bodyContent div#mw-content-text.mw-content-ltr p 不总是有效的,因为有时infobox表中的某些内容有一个段落。此外,介绍性段落的数量也会有所不同。如果有人有一个比我在这里想要的更好的方法,我也会接受的 -- 请求附加代码,尽可能缩短: require HTTP::Request; require

我试图解析一个页面,比如,我只是想得到标题后面的段落,我猜是介绍

我想要
之间的所有内容(包括段落标记)。使用简单的CSS选择器甚至获得第一段:

div#bodyContent div#mw-content-text.mw-content-ltr p

不总是有效的,因为有时infobox表中的某些内容有一个段落。此外,介绍性段落的数量也会有所不同。如果有人有一个比我在这里想要的更好的方法,我也会接受的

--

请求附加代码,尽可能缩短:

require HTTP::Request;
require LWP::UserAgent;

use LWP::Simple;
use HTML::Query 'Query';

my $pageurl = "http://en.wikipedia.org/wiki/Wayne_Rooney";
my $wikiurl = URI->new($pageurl);
my $wikirequest = HTTP::Request->new(GET => $wikiurl);
my $wikiua = LWP::UserAgent->new;
my $wikiresponse = $wikiua->request($wikirequest);
my $pagetoparse = $wikiresponse->content;

my $q2 = Query(text => $pagetoparse);
my @wikiintro = $q2->query('div#bodyContent div#mw-content-text.mw-content-ltr p')->get_elements();
my $pageintro;
if(@wikiintro) {
    if(index($wikiintro[0]->as_text(), "Appearances (Goals)") != -1){
        $pageintro = $wikiintro[1]->as_text();
    } else {
        $pageintro = $wikiintro[0]->as_text();
    }
} else {
    $pageintro = "unavailable";
}

单向使用非标准模块
HTML::TreeBuilder

script.pl的内容

#!/usr/bin/env perl

use warnings;
use strict;
use HTML::TreeBuilder;

my (@p);

## Read the web page.
my $root = HTML::TreeBuilder->new_from_url( shift ) or die qq|ERROR: Malformed URL\n|;

## Get the table tag with id = 'toc'.
my $table_toc = $root->look_down(
    id => 'toc'
);

## Get inmediate previous siblings <p> tags.
for my $node ( reverse $table_toc->left ) { 
    if ( $node->tag eq 'p' ) { 
        unshift @p, $node;
    }   
    else {
        last;
    }   
}

## Print the content without the HTML tags.
for my $p ( @p ) { 
    printf qq|%s\n|, $p->as_text;
}
具有以下输出(我希望接近您的预期):


编辑:要获得结果,还可以使用
printf qq |%s\n |,$p->as_HTML
而不是
$p->as_text

可能是最好的工具。诀窍在于学习和选择它为导航HTML树提供的许多方法

这个程序似乎能满足你的需要。它调用
look\u down
,以查找在所需输出之前具有给定类的表。从这里,对
right
的调用将返回HTML层次结构中该表后面同一级别的所有元素。循环只打印这些元素中的每一个,直到它遇到一个带有标记而不是
p
的元素

我使用
LWP::UserAgent
编写了这篇文章,但显然,如果您更新
HTML::TreeBuilder
的副本,以便可以使用
new\u from\u url
构造函数,代码将更加简洁

use strict;
use warnings;

use LWP;
use HTML::TreeBuilder;

binmode STDOUT, ':utf8';

my $url = 'http://en.wikipedia.org/wiki/Wayne_Rooney';
my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;

my $tree = HTML::TreeBuilder->new_from_content($resp->decoded_content);

my $start = $tree->look_down(_tag => 'table', class => 'infobox vcard');

for ($start->right) {
  last if $_->tag ne 'p';
  print $_->as_trimmed_text. "\n\n";
}
输出

Wayne Mark Rooney (born 24 October 1985) is an English footballer who plays as a forward for Premier League club Manchester United and the England national team.

Rooney made his senior international debut in 2003 becoming the youngest player to represent England, until he got beaten by Theo Walcott. He is England's youngest ever goalscorer.[4] He played at UEFA Euro 2004 and scored four goals, briefly becoming the competition's youngest goalscorer. Rooney featured at the 2006 and 2010 World Cups and is widely regarded as his country's best player.[5][6][7][8][9][10] He has won the England Player of the Year award twice, in 2008 and 2009. As of October 2012, he has won 78 international caps and scored 32 goals, making him England's fifth highest goalscorer in history.[11] Along with David Beckham, Rooney is the most red carded player for England, having been sent off twice.[12]

Aged nine, Rooney joined the youth team of Everton, for whom he made his professional debut in 2002. He spent two seasons at the Merseyside club, before moving to Manchester United for £25.6 million in the 2004 summer transfer window. The same year, Rooney acquired the nickname "Wazza".[13] Since then, with Rooney in the team, United have won the Premier League four times, the 2007–08 UEFA Champions League and two League Cups. He also holds two runner-up medals from the Champions League and has twice finished second in the Premier League. In April 2012, Rooney scored his 180th goal, making him United's fourth-highest goal-scorer of all time.[14]

In 2009–10, Rooney was awarded the PFA Players' Player of the Year and the FWA Footballer of the Year. He has won the Premier League Player of the Month award five times, a record he shares with Steven Gerrard. He came fifth in the vote for the 2011 FIFA Ballon d'Or and was named in the FIFPro World 11 for 2011. Rooney has won the 'Goal of the Season' award by the BBC's Match of the Day poll on three occasions, with his bicycle kick against rivals Manchester City winning the 'Premier League Goal of the 20 Seasons' award.[15] Rooney is the third highest-paid footballer in the world after Lionel Messi and Cristiano Ronaldo, with an annual income of €20.7m (£18m) including sponsorship deals.[16]

我说了我已经尝试过的:“使用简单的CSS选择器来获得第一段:”这就是我试图获得介绍段落的原因,我说了为什么这样做不起作用,为什么即使这样我也想要其他段落。至于如何处理中间div的问题,我已经在谷歌上搜索了多个不同的查询,但我找不到任何结果。你能给我们看一些代码吗?您使用了哪些模块?如果您使用的是,可能会有一些技巧来获得正确的输出。我将提供我现在所做的代码,但我应该重复:我所做的只是获得第一段。我已经找到了一种方法来避免出现这样的问题:它有时会在表中选择一个段落,类为“infobox”,您可以在代码中看到它。然而,我仍然不认为这是正确的方法,因为我不认为我可以获得其他段落,也不知道何时停止(当表id“toc”开始时)。换句话说,这与我试图做的几乎完全无关,这更多的是一个占位符的工作,直到我想出了如何在元素之间插入的方法。我试图将它合并到我现有的脚本中,但出现了一些错误。不管怎么说,我只是把它完全制作成自己的脚本,并试着运行它。我一直通过“HTML::TreeBuilder”包找不到对象方法“new\u from\u url”而且:我确实有LWP::UserAgentNevermind,我正在使用
new\u from\u content()
与我在原始帖子中使用的代码一起使用的页面内容。一旦我在预期的应用程序中实现了这个功能,我就会选择答案。@MarkLyons:HTML::TreeBuilder的最新版本中提供了
new\u from\u url
构造函数。你应该考虑更新你的模块库。
use strict;
use warnings;

use LWP;
use HTML::TreeBuilder;

binmode STDOUT, ':utf8';

my $url = 'http://en.wikipedia.org/wiki/Wayne_Rooney';
my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;

my $tree = HTML::TreeBuilder->new_from_content($resp->decoded_content);

my $start = $tree->look_down(_tag => 'table', class => 'infobox vcard');

for ($start->right) {
  last if $_->tag ne 'p';
  print $_->as_trimmed_text. "\n\n";
}
Wayne Mark Rooney (born 24 October 1985) is an English footballer who plays as a forward for Premier League club Manchester United and the England national team.

Rooney made his senior international debut in 2003 becoming the youngest player to represent England, until he got beaten by Theo Walcott. He is England's youngest ever goalscorer.[4] He played at UEFA Euro 2004 and scored four goals, briefly becoming the competition's youngest goalscorer. Rooney featured at the 2006 and 2010 World Cups and is widely regarded as his country's best player.[5][6][7][8][9][10] He has won the England Player of the Year award twice, in 2008 and 2009. As of October 2012, he has won 78 international caps and scored 32 goals, making him England's fifth highest goalscorer in history.[11] Along with David Beckham, Rooney is the most red carded player for England, having been sent off twice.[12]

Aged nine, Rooney joined the youth team of Everton, for whom he made his professional debut in 2002. He spent two seasons at the Merseyside club, before moving to Manchester United for £25.6 million in the 2004 summer transfer window. The same year, Rooney acquired the nickname "Wazza".[13] Since then, with Rooney in the team, United have won the Premier League four times, the 2007–08 UEFA Champions League and two League Cups. He also holds two runner-up medals from the Champions League and has twice finished second in the Premier League. In April 2012, Rooney scored his 180th goal, making him United's fourth-highest goal-scorer of all time.[14]

In 2009–10, Rooney was awarded the PFA Players' Player of the Year and the FWA Footballer of the Year. He has won the Premier League Player of the Month award five times, a record he shares with Steven Gerrard. He came fifth in the vote for the 2011 FIFA Ballon d'Or and was named in the FIFPro World 11 for 2011. Rooney has won the 'Goal of the Season' award by the BBC's Match of the Day poll on three occasions, with his bicycle kick against rivals Manchester City winning the 'Premier League Goal of the 20 Seasons' award.[15] Rooney is the third highest-paid footballer in the world after Lionel Messi and Cristiano Ronaldo, with an annual income of €20.7m (£18m) including sponsorship deals.[16]