如何在python中解析html表_Python_Beautifulsoup

如何在python中解析html表

python

如何在python中解析html表,python,beautifulsoup,Python,Beautifulsoup,我是解析表和正则表达式的新手，您能帮我用python解析一下吗： <table callspacing="0" cellpadding="0"> <tbody><tr> <td>1text 2text</td> <td>3text </td> </tr> <tr> &

我是解析表和正则表达式的新手，您能帮我用python解析一下吗：

<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>

我需要3text和6text

您可以使用CSS选择器选择并选择其中一个来获得3text和6text，如下所示：

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''

soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')

for i in soup1:
    print(i.select_one('td:nth-child(2)').text)

结果:

3text 
6text

使用熊猫

试试这个：

from bs4 import BeautifulSoup

html="""
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>"""

soup = BeautifulSoup(html, 'html.parser')

for tr_soup in soup.find_all('tr'):
    td_soup = tr_soup.find_all('td')
    print(td_soup[1].text.strip())

您可以使用pythons html.parser：

跟踪当前解析状态的自定义解析器类。由于需要每行的第二个单元格，因此在开始一行时，每行都会重置单元格计数器索引。每个单元格递增计数器

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_cell = False
        self.cell_index = -1

    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            self.cell_index = -1
        if tag == 'td':
            self.in_cell = True
            self.cell_index += 1
        # print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        if tag == 'td':
            self.in_cell = False
        # print("Encountered an end tag :", tag)

    def handle_data(self, data):
        if self.in_cell and self.cell_index == 1:
            print(data.strip())

parser = MyHTMLParser()
parser.feed('''<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>''')

最好的方法是使用beautifulsoup

from bs4 import BeautifulSoup

html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''


soup = BeautifulSoup(html_doc, "html.parser")

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    for k in i.find_all("td"):
        # prints all td tags with a text format
        print(k.text)

但是你可以通过索引获取你想要的文本。在这种情况下，你可以选择

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    print(i.find_all("td")[1].text)

由于您的问题附有beautifulsoup标签，我假设您很高兴使用此模块解决您遇到的问题。我的解决方案还利用内置的Unicode数据模块来解析HTML中存在的任何转义字符，例如

要解析表格，以便您可以根据问题访问表格中每行的第二个字段，请参阅下面的代码/注释

从bs4导入BeautifulSoup 导入Unicode数据桌子 1文本2文本 3文本 4课文5课文 6文本 ' soup=BeautifulSouptable，'html.parser'解析html表 tableData=soup.find_all'td'从表中获取所有标记的列表存储规范化内容基本上解析unicode字符，在本例中，从表到列表的每2个标记都会影响空格输出=[Unicode数据。如果i%2！=0，则规范化'NFKC'，d.EnumeratedTableData中i，d的文本]

> python -u "html_parser_test.py"
3text
6text

from bs4 import BeautifulSoup

html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''


soup = BeautifulSoup(html_doc, "html.parser")

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    for k in i.find_all("td"):
        # prints all td tags with a text format
        print(k.text)

1text 2text
3text 
4text 5text
6text

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    print(i.find_all("td")[1].text)