Python 使用BeautifulSoup提取特定TD表格元素文本？_Python_Html_Beautifulsoup

Python 使用BeautifulSoup提取特定TD表格元素文本？

python html

Python 使用BeautifulSoup提取特定TD表格元素文本？,python,html,beautifulsoup,Python,Html,Beautifulsoup,我试图使用BeautifulSoup库从自动生成的HTML表中提取IP地址，但遇到了一点麻烦 HTML的结构如下所示： <html> <body> <table class="mainTable"> <thead> <tr> <th>IP</th> <th>Country</th> </t

我试图使用BeautifulSoup库从自动生成的HTML表中提取IP地址，但遇到了一点麻烦

HTML的结构如下所示：

<html>
<body>
    <table class="mainTable">
    <thead>
        <tr>
            <th>IP</th>
            <th>Country</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="hello.html">127.0.0.1<a></td>
            <td><img src="uk.gif" /><a href="uk.com">uk</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">192.168.0.1<a></td>
            <td><img src="uk.gif" /><a href="us.com">us</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">255.255.255.0<a></td>
            <td><img src="uk.gif" /><a href="br.com">br</a></td>
        </tr>
    </tbody>
</table>

这将产生：

127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br

我需要的是IP

table.tbody.tr.td.a

元素文本，而不是国家

table.tbody.tr.td.img.a

元素

是否有任何经验丰富的BeautifulSoup用户会对如何进行此选择和提取有所了解

谢谢。

首先搜索

正文

中的每一行：
# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]

或者更具可读性：
rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]

这将为您提供正确的列表：
>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>]

最终：
>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']

您可以使用一个小正则表达式来提取ip地址。BeautifulSoup与正则表达式是一个很好的刮片组合
ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
    if ip_pat.match(row.text):
        print(row.text)    

首先请注意，HTML的格式不正确。它没有关闭a
标记。有两个，可以使用XPath更简洁地表达标准：
import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)

上面使用的XPath具有以下含义：
//table                            select all <table> tags
    [@class="mainTable"]           that have a class="mainTable" attribute
//                                 from these tags select descendants
  td/a                             which are td tags with a child <a> tag
    [not(preceding-sibling::img)]  such that it does not have a preceding sibling <img> tag
    /text()                        return the text of the <a> tag 

//表选择所有标记
[@class=“mainTable”]具有class=“mainTable”属性的
//从这些标记中选择子体
td/a是带孩子的td标签，但一旦你学会了，你可能再也不想使用BeautifulSoup了。这是一个不错的方法和有用的解决方案。
<a href="hello.html">127.0.0.1<a>

<td>
<a href="hello.html">127.0.0.1</a><a></a>
</td>
...

for row in table.findAll("a")[::3]:
    print(row.get_text())

127.0.0.1
192.168.0.1
255.255.255.0

for row in table.findAll("a"):
    sibling = row.findPreviousSibling()
    if sibling is None:
        print(row.get_text())

import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)

//table                            select all <table> tags
    [@class="mainTable"]           that have a class="mainTable" attribute
//                                 from these tags select descendants
  td/a                             which are td tags with a child <a> tag
    [not(preceding-sibling::img)]  such that it does not have a preceding sibling <img> tag
    /text()                        return the text of the <a> tag