Python 使用LXML解析Html数据
这给了我q的,但它也给了我与类mod内容页面上的其他细节。 我如何明确地只获得q值 我正在使用lxmlPython 使用LXML解析Html数据,python,css-selectors,lxml,Python,Css Selectors,Lxml,这给了我q的,但它也给了我与类mod内容页面上的其他细节。 我如何明确地只获得q值 我正在使用lxml doc=lh.fromstring(resp.read()) for id in doc.cssselect('div.mod-content' ): print id.text_content() 人 受让人: 多个受让人: ,
doc=lh.fromstring(resp.read())
for id in doc.cssselect('div.mod-content' ):
print id.text_content()
人
-
受让人:
多个受让人:
,
(1)
首先:如果你正在解析HTML,很有可能是人类弄乱了它,无法正确验证。例如,您发布的示例就是这种情况(有两个
缺少…)。考虑传递,而不是专门设计来适应这些错误。
也就是说,如果您的问题只是关于如何提取“HTML的文本部分”,或者换句话说,如何转换HTML→ 纯文本[与“仅提取特定HTML容器中包含的文本”相反],这是一个简单的工作示例:
<div id="peoplemodule" class="module toggle-wrap">
<div class="mod-header">
<h3 class="toggle-title">People</h3>
</div>
<div class="mod-content">
<ul class="item-details" id="peopledetails">
<li class="people-details">
<dl>
<dt>Assignee:</dt>
<dd id="Assign-Val">
<a class="user-hover" rel="605794069" id="issue_summary_assignee_605794069" href="--------------"> AAAAAAAAAAAAA a>
</dd>
</dl>
<dl>
<dt>Reporter:</dt>
<dd id="Report-Val">
<a class="user-hover" rel="700843051" id="issue_summary_reporter_700843051" href="-------------------------">BBBBBBBBBBBBBB</a>
</dd>
</dl>
<dl><dt> </dt><dd> </dd></dl>
<dl>
<dt title="Multiple Assignees">Multiple Assignees:</dt>
<dd id="customfield_10020-val"> <div class="shorten" id="customfield_10020-field">
<span class="tinylink"> <a class="user-hover" rel="604810609" id="multiuser_cf_604810609" href------------------">FFFFFFFFFFFFFF</a></span>, <span class="tinylink"> <a class="user-hover" rel="600548483" id="multiuser_cf_600548483" href="------------------------------------">EEEEEEEEEEEEEEEEE</a></span> </div>
</dd>
</dl>
</li>
</ul>
<div id="watchers-val">
<a href="----------------------------------------" id="watching-toggle" rel="858270" title="Start watching this story"><span class="icon icon-watch-off"></span><span class="action-text">Watch</span></a>
(<span id="watcher-data">1</span>)
</div>
</div>
</div>
HTH!还有什么“其他详细信息"? 您共享的代码片段中只有q。而且,你的答案很大程度上取决于特定网站的来源。我忘了提到,这个片段是网页的一小部分,mod content类也用于其他地方,因此在打印时,它也打印其他值。正如我所说,这取决于网站和你感兴趣的内容。您需要为内容提供足够的特定性。例如,如果这是您想要的唯一div,您可以通过它的
id
进行选择,因为它应该是唯一的。Hi Mac,Thnx对于您的答案,我编辑了我的问题,在这种情况下,xpath标识符文本可以进一步修改,对吗?为了满足必要的条件,我需要再次从中提取文本。它给出了错误,是因为页面的结构吗?@VinodK-你能澄清一下你的问题吗?如果您试图只匹配文档中的某些标记,您可以使用类似于打印树.find(“.//h3”).text的内容。[在我的回答中提供的示例中,这将返回“Description”]。。。但正如Avaris在评论中指出的那样,您需要确定要提取的文档叶的独特特征。。。
<div id="peoplemodule" class="module toggle-wrap">
<div class="mod-header">
<h3 class="toggle-title">People</h3>
</div>
<div class="mod-content">
<ul class="item-details" id="peopledetails">
<li class="people-details">
<dl>
<dt>Assignee:</dt>
<dd id="Assign-Val">
<a class="user-hover" rel="605794069" id="issue_summary_assignee_605794069" href="--------------"> AAAAAAAAAAAAA a>
</dd>
</dl>
<dl>
<dt>Reporter:</dt>
<dd id="Report-Val">
<a class="user-hover" rel="700843051" id="issue_summary_reporter_700843051" href="-------------------------">BBBBBBBBBBBBBB</a>
</dd>
</dl>
<dl><dt> </dt><dd> </dd></dl>
<dl>
<dt title="Multiple Assignees">Multiple Assignees:</dt>
<dd id="customfield_10020-val"> <div class="shorten" id="customfield_10020-field">
<span class="tinylink"> <a class="user-hover" rel="604810609" id="multiuser_cf_604810609" href------------------">FFFFFFFFFFFFFF</a></span>, <span class="tinylink"> <a class="user-hover" rel="600548483" id="multiuser_cf_600548483" href="------------------------------------">EEEEEEEEEEEEEEEEE</a></span> </div>
</dd>
</dl>
</li>
</ul>
<div id="watchers-val">
<a href="----------------------------------------" id="watching-toggle" rel="858270" title="Start watching this story"><span class="icon icon-watch-off"></span><span class="action-text">Watch</span></a>
(<span id="watcher-data">1</span>)
</div>
</div>
</div>
from lxml import etree
content = '''<div id="descriptionmodule" class="module toggle-wrap">
<div class="mod-header">
<h3 class="toggle-title">Description</h3>
</div>
<div id="issue-description" class="mod-content">
<p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<ul class="alternate" type="square">
<li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul></div></div>'''
tree = etree.fromstring(content)
for bit in tree.xpath('//text()'):
if bit.strip(): # you can insert any kind of test here
print bit
Description
qqqqqqqqqqqqq,
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq