Python regex/beautifulsoup如何从html表中提取列的所有值？_Python_Html_Regex_Beautifulsoup

Python regex/beautifulsoup如何从html表中提取列的所有值？

python html regex

Python regex/beautifulsoup如何从html表中提取列的所有值？,python,html,regex,beautifulsoup,Python,Html,Regex,Beautifulsoup,根据该代码： <tr><td>PC1</td><td>zz:zz:zz:zz:zz:ce</td><td>10.0.0.244</td><td>23 hours, 55 minutes, 25 seconds</td></tr> <tr><td>PC2</td><td>zz:zz:zz:zz:zz:cf</td><

根据该代码：

<tr><td>PC1</td><td>zz:zz:zz:zz:zz:ce</td><td>10.0.0.244</td><td>23 hours, 55 minutes, 25 seconds</td></tr>
<tr><td>PC2</td><td>zz:zz:zz:zz:zz:cf</td><td>10.0.0.245</td><td>23 hours, 23 minutes, 27 seconds</td></tr>

PC1zz:zz:zz:zz:zz:ce10.0.0.24423小时55分25秒
PC2zz:zz:zz:zz:cf10.0.0.24523小时23分27秒

我想得到一个MAC地址数组和另一个ip地址数组

我想到了类似Mac的正则表达式：

（.*）{17}

但它也符合正常运行时间

有什么建议吗

谢谢

从您提供的

html

中，您可以执行以下操作：

from bs4 import BeautifulSoup

html = """<tr><td>PC1</td><td>zz:zz:zz:zz:zz:ce</td><td>10.0.0.244</td><td>23 hours, 55 minutes, 25 seconds</td></tr>
<tr><td>PC2</td><td>zz:zz:zz:zz:zz:cf</td><td>10.0.0.245</td><td>23 hours, 23 minutes, 27 seconds</td></tr>"""

soup = BeautifulSoup(html)
mac_ips = []

for tr in soup.find_all('tr'):
    cols = [td.text for td in tr.find_all('td')]
    mac_ips.append((cols[1], cols[2]))

for mac, ip in mac_ips:
    print '{}  {}'.format(mac, ip)

from bs4 import BeautifulSoup

html = """<tr><td>PC1</td><td>zz:zz:zz:zz:zz:ce</td><td>10.0.0.244</td><td>23 hours, 55 minutes, 25 seconds</td></tr>
<tr><td>PC2</td><td>zz:zz:zz:zz:zz:cf</td><td>10.0.0.245</td><td>23 hours, 23 minutes, 27 seconds</td></tr>"""

soup = BeautifulSoup(html)
mac = []
ip = []

for tr in soup.find_all('tr'):
    cols = [td.text for td in tr.find_all('td')]
    mac.append(cols[1])
    ip.append(cols[2])

print mac
print ip

i、 e

mac\u ips

将每一行作为匹配对保存：

[(u'zz:zz:zz:zz:zz:ce', u'10.0.0.244'), (u'zz:zz:zz:zz:zz:cf', u'10.0.0.245')]

如果要分隔列表，可以执行以下操作：

from bs4 import BeautifulSoup

html = """<tr><td>PC1</td><td>zz:zz:zz:zz:zz:ce</td><td>10.0.0.244</td><td>23 hours, 55 minutes, 25 seconds</td></tr>
<tr><td>PC2</td><td>zz:zz:zz:zz:zz:cf</td><td>10.0.0.245</td><td>23 hours, 23 minutes, 27 seconds</td></tr>"""

soup = BeautifulSoup(html)
mac_ips = []

for tr in soup.find_all('tr'):
    cols = [td.text for td in tr.find_all('td')]
    mac_ips.append((cols[1], cols[2]))

for mac, ip in mac_ips:
    print '{}  {}'.format(mac, ip)

from bs4 import BeautifulSoup

html = """<tr><td>PC1</td><td>zz:zz:zz:zz:zz:ce</td><td>10.0.0.244</td><td>23 hours, 55 minutes, 25 seconds</td></tr>
<tr><td>PC2</td><td>zz:zz:zz:zz:zz:cf</td><td>10.0.0.245</td><td>23 hours, 23 minutes, 27 seconds</td></tr>"""

soup = BeautifulSoup(html)
mac = []
ip = []

for tr in soup.find_all('tr'):
    cols = [td.text for td in tr.find_all('td')]
    mac.append(cols[1])
    ip.append(cols[2])

print mac
print ip

注意：如果您要解析更多的html，那么可能还需要首先找到所包含的

，因为您已经知道mac地址在第二列中，请在xpath查询中使用lxml（比beautiful soup更快）。您不需要regex。可能重复请添加一些关于您的解决方案的注释，说明为什么以及如何解决问题

try:
    table = soup.find('table')
except AttributeError as e:
    print 'No tables found, exiting'
    return 1

# Get rows
try:
    rows = table.find_all('tr')
except AttributeError as e:
    print 'No table rows found, exiting'
    return 1