Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/typo3/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用LXML解析Html数据_Python_Css Selectors_Lxml - Fatal编程技术网

Python 使用LXML解析Html数据

Python 使用LXML解析Html数据,python,css-selectors,lxml,Python,Css Selectors,Lxml,这给了我q的,但它也给了我与类mod内容页面上的其他细节。 我如何明确地只获得q值 我正在使用lxml doc=lh.fromstring(resp.read()) for id in doc.cssselect('div.mod-content' ): print id.text_content() 人 受让人: 多个受让人: ,

这给了我q的,但它也给了我与类mod内容页面上的其他细节。 我如何明确地只获得q值

我正在使用lxml

doc=lh.fromstring(resp.read())  
for id in doc.cssselect('div.mod-content' ):
    print id.text_content()
  • 受让人: 多个受让人: ,
(1)
首先:如果你正在解析HTML,很有可能是人类弄乱了它,无法正确验证。例如,您发布的示例就是这种情况(有两个
缺少…)。考虑传递,而不是专门设计来适应这些错误。

也就是说,如果您的问题只是关于如何提取“HTML的文本部分”,或者换句话说,如何转换HTML→ 纯文本[与“仅提取特定HTML容器中包含的文本”相反],这是一个简单的工作示例:

<div id="peoplemodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">People</h3>
    </div>
    <div class="mod-content">
        <ul class="item-details" id="peopledetails">
            <li class="people-details">
                                <dl>
                    <dt>Assignee:</dt>
                    <dd id="Assign-Val">
                                <a class="user-hover" rel="605794069" id="issue_summary_assignee_605794069" href="--------------"> AAAAAAAAAAAAA a>
                    </dd>
                </dl>
                                                <dl>
                    <dt>Reporter:</dt>
                    <dd id="Report-Val">
                                <a class="user-hover" rel="700843051" id="issue_summary_reporter_700843051" href="-------------------------">BBBBBBBBBBBBBB</a>
                    </dd>
                </dl>
                                <dl><dt>&nbsp;</dt><dd>&nbsp;</dd></dl>
                                <dl>
                    <dt title="Multiple Assignees">Multiple Assignees:</dt>
                    <dd id="customfield_10020-val">    <div class="shorten" id="customfield_10020-field">
                                    <span class="tinylink">        <a class="user-hover" rel="604810609" id="multiuser_cf_604810609" href------------------">FFFFFFFFFFFFFF</a></span>,                                                 <span class="tinylink">        <a class="user-hover" rel="600548483" id="multiuser_cf_600548483" href="------------------------------------">EEEEEEEEEEEEEEEEE</a></span>                        </div>
</dd>
                </dl>
                            </li>
        </ul>
                        <div id="watchers-val">
                                                <a href="----------------------------------------" id="watching-toggle" rel="858270" title="Start watching this story"><span class="icon icon-watch-off"></span><span class="action-text">Watch</span></a>


                            (<span id="watcher-data">1</span>)
                    </div>
            </div>
</div>

HTH!

还有什么“其他详细信息"? 您共享的代码片段中只有q。而且,你的答案很大程度上取决于特定网站的来源。我忘了提到,这个片段是网页的一小部分,mod content类也用于其他地方,因此在打印时,它也打印其他值。正如我所说,这取决于网站和你感兴趣的内容。您需要为内容提供足够的特定性。例如,如果这是您想要的唯一div,您可以通过它的
id
进行选择,因为它应该是唯一的。Hi Mac,Thnx对于您的答案,我编辑了我的问题,在这种情况下,xpath标识符文本可以进一步修改,对吗?为了满足必要的条件,我需要再次从中提取文本。它给出了错误,是因为页面的结构吗?@VinodK-你能澄清一下你的问题吗?如果您试图只匹配文档中的某些标记,您可以使用类似于
打印树.find(“.//h3”).text的内容。[在我的回答中提供的示例中,这将返回“Description”]。。。但正如Avaris在评论中指出的那样,您需要确定要提取的文档叶的独特特征。。。
<div id="peoplemodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">People</h3>
    </div>
    <div class="mod-content">
        <ul class="item-details" id="peopledetails">
            <li class="people-details">
                                <dl>
                    <dt>Assignee:</dt>
                    <dd id="Assign-Val">
                                <a class="user-hover" rel="605794069" id="issue_summary_assignee_605794069" href="--------------"> AAAAAAAAAAAAA a>
                    </dd>
                </dl>
                                                <dl>
                    <dt>Reporter:</dt>
                    <dd id="Report-Val">
                                <a class="user-hover" rel="700843051" id="issue_summary_reporter_700843051" href="-------------------------">BBBBBBBBBBBBBB</a>
                    </dd>
                </dl>
                                <dl><dt>&nbsp;</dt><dd>&nbsp;</dd></dl>
                                <dl>
                    <dt title="Multiple Assignees">Multiple Assignees:</dt>
                    <dd id="customfield_10020-val">    <div class="shorten" id="customfield_10020-field">
                                    <span class="tinylink">        <a class="user-hover" rel="604810609" id="multiuser_cf_604810609" href------------------">FFFFFFFFFFFFFF</a></span>,                                                 <span class="tinylink">        <a class="user-hover" rel="600548483" id="multiuser_cf_600548483" href="------------------------------------">EEEEEEEEEEEEEEEEE</a></span>                        </div>
</dd>
                </dl>
                            </li>
        </ul>
                        <div id="watchers-val">
                                                <a href="----------------------------------------" id="watching-toggle" rel="858270" title="Start watching this story"><span class="icon icon-watch-off"></span><span class="action-text">Watch</span></a>


                            (<span id="watcher-data">1</span>)
                    </div>
            </div>
</div>
from lxml import etree

content = '''<div id="descriptionmodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">Description</h3>
    </div>
    <div id="issue-description" class="mod-content">
        <p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<ul class="alternate" type="square">
    <li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul></div></div>'''

tree = etree.fromstring(content)

for bit in tree.xpath('//text()'):
    if bit.strip():  # you can insert any kind of test here
        print bit
Description
qqqqqqqqqqqqq,

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq