Python-尝试使用LXML解析网站搜索结果的问题
我正在尝试使用LXML解析从此搜索URL返回的搜索结果:Python-尝试使用LXML解析网站搜索结果的问题,python,html-parsing,lxml,Python,Html Parsing,Lxml,我正在尝试使用LXML解析从此搜索URL返回的搜索结果: http://www.rte.ie/player/ie/search/?q=news HTML中返回的文章标记如下: <article class="search-result clearfix"><a href="/player/ie/show/10117771/" class="thumbnail-programme-link"><span class="sprite thu
http://www.rte.ie/player/ie/search/?q=news
HTML中返回的文章标记如下:
<article class="search-result clearfix"><a
href="/player/ie/show/10117771/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/0005d4bf-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117771/">elev8</a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117771/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">Ivan and Sean talk to future basketball sensation Julian Newman and the <span class="search-highlight">News</span> Dudes are in the loft with some crazy <span class="search-highlight">news</span> stories.</p>
<span
class="sprite logo-rte-two search-channel-icon">RTÉ 2</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10118015/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/000716b2-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10118015/">One <span class="search-highlight">News</span></a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10118015/">Wed 06 Mar 2013</a></p>
<!-- p class="search-programme-date">06/03/2013</p -->
<p class="search-programme-description">The One O'Clock <span class="search-highlight">News</span> followed by Weather.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117836/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/00071614-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117836/"><span class="search-highlight">News</span> on Two and World Forecast</a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117836/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">All the <span class="search-highlight">news</span> and sport from home and abroad.</p>
<span
class="sprite logo-rte-two search-channel-icon">RTÉ 2</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117816/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/000715f2-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117816/">Nine <span class="search-highlight">News</span></a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117816/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">The Nine <span class="search-highlight">News</span> followed by Weather.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117789/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/000715ae-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117789/">Six One <span class="search-highlight">News</span></a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117789/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">The Six One <span class="search-highlight">News</span> and Sport followed by Weather.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117784/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/000715a0-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117784/">Nuacht and <span class="search-highlight">News</span> with Signing</a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117784/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">Nuacht and <span class="search-highlight">News</span> with Signing.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117770/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/0007158d-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117770/"><span class="search-highlight">News</span>2Day</a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117770/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">Domestic and international <span class="search-highlight">news</span> items of interest to younger viewers.</p>
<span
class="sprite logo-rte-two search-channel-icon">RTÉ 2</span>
</article>
<article class="search-result clearfix"><a
href="/player/ie/show/10117728/" class="thumbnail-programme-link"><span
class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
src="http://img.rasset.ie/0007154e-261.jpg"></a>
<h3 class="search-programme-title"><a href="/player/ie/show/10117728/">One <span class="search-highlight">News</span></a></h3>
<p class="search-programme-episodes"><a href="/player/ie/show/10117728/">Tue 05 Mar 2013</a></p>
<!-- p class="search-programme-date">05/03/2013</p -->
<p class="search-programme-description">The One O'Clock <span class="search-highlight">News</span> followed by Weather.</p>
<span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
</article>
问题字段是name_tmp和short_tmp,由于搜索突出显示span标记,它们正在删除全文名称。有人能想出一种解析全文或忽略span标记的方法吗
很抱歉写了这么长的文章…我认为您可以在节点上使用
itertext()
方法从所有子代文本节点获取内容。您正在查找:
及
在这些修复就绪后,您的代码将打印:
icon_url http://img.rasset.ie/0005d4bf-261.jpg
name_tmp elev8
stream /player/no/show/10117771/
date_tmp Tue 05 Mar 2013
short_tmp Ivan and Sean talk to future basketball sensation Julian Newman and the News Dudes are in the loft with some crazy news stories.
channel RTÉ 2
icon_url http://img.rasset.ie/000716b2-261.jpg
name_tmp One News
stream /player/no/show/10118015/
date_tmp Wed 06 Mar 2013
short_tmp The One O'Clock News followed by Weather.
channel RTÉ 1
等等。您可以使用它使其更可读、更健壮:
from lxml import html
tree = html.parse("http://www.rte.ie/player/ie/search/?q=news")
for article in tree.xpath('//article[@class="search-result clearfix"]'):
select = lambda expr: article.cssselect(expr)[0]
title = select(".search-programme-title")
info = dict(
icon_url=select("img.thumbnail").get('src'),
name=title.text_content(),
stream=title.find('a').get('href'),
date=select(".search-programme-episodes").text_content(),
short=select(".search-programme-description").text_content(),
channel=select(".search-channel-icon").text_content())
print(info)
输出
short_tmp = ''.join(elem[4].itertext())
icon_url http://img.rasset.ie/0005d4bf-261.jpg
name_tmp elev8
stream /player/no/show/10117771/
date_tmp Tue 05 Mar 2013
short_tmp Ivan and Sean talk to future basketball sensation Julian Newman and the News Dudes are in the loft with some crazy news stories.
channel RTÉ 2
icon_url http://img.rasset.ie/000716b2-261.jpg
name_tmp One News
stream /player/no/show/10118015/
date_tmp Wed 06 Mar 2013
short_tmp The One O'Clock News followed by Weather.
channel RTÉ 1
from lxml import html
tree = html.parse("http://www.rte.ie/player/ie/search/?q=news")
for article in tree.xpath('//article[@class="search-result clearfix"]'):
select = lambda expr: article.cssselect(expr)[0]
title = select(".search-programme-title")
info = dict(
icon_url=select("img.thumbnail").get('src'),
name=title.text_content(),
stream=title.find('a').get('href'),
date=select(".search-programme-episodes").text_content(),
short=select(".search-programme-description").text_content(),
channel=select(".search-channel-icon").text_content())
print(info)
{'short': 'Ivan and Sean talk to future basketball sensation Julian Newman and the News Dudes are in the loft with some crazy news stories.', 'stream': '/player/ru/show/10117771/', 'name': 'elev8', 'date': 'Tue 05 Mar 2013', 'icon_url': 'http://img.rasset.ie/0005d4bf-261.jpg', 'channel': 'RTÉ 2'}
{'short': "The One O'Clock News followed by Weather.", 'stream': '/player/ru/show/10118015/', 'name': 'One News', 'date': 'Wed 06 Mar 2013', 'icon_url': 'http://img.rasset.ie/000716b2-261.jpg', 'channel': 'RTÉ 1'}
...