Python 用lxml解析HTML数据_Python_Html Parsing_Lxml

Python 用lxml解析HTML数据

python

Python 用lxml解析HTML数据,python,html-parsing,lxml,Python,Html Parsing,Lxml,我是一名编码初学者，我的一位朋友告诉我使用BeautifulSoup而不是HTMLPasser。遇到一些问题后，我得到了一个建议，用lxml代替beayfulsoup，因为它比beayfulsoup好10倍我希望有人能给我一个提示，如何刮我正在寻找的文本我想要的是找到一个包含以下行和数据的表： <tr> <td><a href="website1.com">website1</a></td> <td>in

我是一名编码初学者，我的一位朋友告诉我使用BeautifulSoup而不是HTMLPasser。遇到一些问题后，我得到了一个建议，用lxml代替beayfulsoup，因为它比beayfulsoup好10倍

我希望有人能给我一个提示，如何刮我正在寻找的文本

我想要的是找到一个包含以下行和数据的表：

<tr>
    <td><a href="website1.com">website1</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam1.com">spam1</a></td>
</tr>
<tr>
    <td><a href="website2.com">website2</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam2.com">spam2</a></td>
</tr>

我使用xpath：

td/a[not（包含（，“垃圾邮件”）]/@href | td[not（a）]/text（）
结果：
[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]
[['website1.com'，'info1'，'info2']，['website2.com'，'info1'，'info2']]
长XPath具有以下含义：
td[1]                                   find the first <td>  
  /a                                    find the <a>
    /@href                              return its href attribute value
|                                       or
td[position()=2 or position()=3]        find the second or third <td>
  /text()                               return its text value

td[1]找到第一个
/a找到
/@href返回其href属性值
|或
td[position（）=2或position（）=3]查找第二个或第三个
/text（）返回其文本值
表中的所有表行都相同。我正在使用Python 2.7.2+。在表行中，我只想要前3个，结果是。所以['url（website1）'，'info1'，'info2']，['url（website2）'，'info1'，'info2']。谢谢你的回复。我想可以有把握地假设实际内容中不会包含垃圾邮件。虽然只有@Trees才能真正告诉我们数据的哪些方面是一致的。@Acorn改为包含（，“垃圾邮件”）
<代码>垃圾邮件

可以被类似于

ad.website.com

的模式所取代。你只需要几行代码就可以让我开心一天。谢谢你的解释。实际上所有的答案都很好。我正在学习xpath，以便使用firebug获得它。但是his更容易找到第一个表行并处理其中的数据。再次感谢大家，圣诞快乐：）

import lxml.html as lh

tree = lh.fromstring(your_html)

result = []
for row in tree.xpath("tr"):
    url, info1, info2 = row.xpath("td")[:3]
    result.append([url.xpath("a")[0].attrib['href'],
                   info1.text_content(),
                   info2.text_content()])

[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]

import lxml.html as LH

doc = LH.fromstring(content)
print([tr.xpath('td[1]/a/@href | td[position()=2 or position()=3]/text()')
       for tr in doc.xpath('//tr')])

td[1]                                   find the first <td>  
  /a                                    find the <a>
    /@href                              return its href attribute value
|                                       or
td[position()=2 or position()=3]        find the second or third <td>
  /text()                               return its text value