Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/356.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在python中解析html表_Python_Beautifulsoup - Fatal编程技术网

如何在python中解析html表

如何在python中解析html表,python,beautifulsoup,Python,Beautifulsoup,我是解析表和正则表达式的新手,您能帮我用python解析一下吗: <table callspacing="0" cellpadding="0"> <tbody><tr> <td>1text&nbsp;2text</td> <td>3text&nbsp;</td> </tr> <tr> &

我是解析表和正则表达式的新手,您能帮我用python解析一下吗:

<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
我需要3text和6text

您可以使用CSS选择器选择并选择其中一个来获得3text和6text,如下所示:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''

soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')

for i in soup1:
    print(i.select_one('td:nth-child(2)').text)
结果:

3text 
6text 
使用熊猫

试试这个:

from bs4 import BeautifulSoup

html="""
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>"""

soup = BeautifulSoup(html, 'html.parser')

for tr_soup in soup.find_all('tr'):
    td_soup = tr_soup.find_all('td')
    print(td_soup[1].text.strip())

您可以使用pythons html.parser:

跟踪当前解析状态的自定义解析器类。 由于需要每行的第二个单元格,因此在开始一行时,每行都会重置单元格计数器索引。每个单元格递增计数器

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_cell = False
        self.cell_index = -1

    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            self.cell_index = -1
        if tag == 'td':
            self.in_cell = True
            self.cell_index += 1
        # print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        if tag == 'td':
            self.in_cell = False
        # print("Encountered an end tag :", tag)

    def handle_data(self, data):
        if self.in_cell and self.cell_index == 1:
            print(data.strip())

parser = MyHTMLParser()
parser.feed('''<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>''')

最好的方法是使用beautifulsoup

from bs4 import BeautifulSoup

html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''


soup = BeautifulSoup(html_doc, "html.parser")

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    for k in i.find_all("td"):
        # prints all td tags with a text format
        print(k.text)
但是你可以通过索引获取你想要的文本。在这种情况下,你可以选择

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    print(i.find_all("td")[1].text)

由于您的问题附有beautifulsoup标签,我假设您很高兴使用此模块解决您遇到的问题。我的解决方案还利用内置的Unicode数据模块来解析HTML中存在的任何转义字符,例如

要解析表格,以便您可以根据问题访问表格中每行的第二个字段,请参阅下面的代码/注释

从bs4导入BeautifulSoup 导入Unicode数据 桌子 1文本2文本 3文本 4课文5课文 6文本 ' soup=BeautifulSouptable,'html.parser'解析html表 tableData=soup.find_all'td'从表中获取所有标记的列表 存储规范化内容基本上解析unicode字符,在本例中,从表到列表的每2个标记都会影响空格 输出=[Unicode数据。如果i%2!=0,则规范化'NFKC',d.EnumeratedTableData中i,d的文本]
> python -u "html_parser_test.py"
3text
6text
from bs4 import BeautifulSoup

html_doc='''
<table callspacing="0" cellpadding="0">
    <tbody><tr>
    <td>1text&nbsp;2text</td>
    <td>3text&nbsp;</td>
    </tr>
    <tr>
    <td>4text&nbsp;5text</td>
    <td>6text&nbsp;</td>
    </tr>
</tbody></table>
'''


soup = BeautifulSoup(html_doc, "html.parser")

# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    for k in i.find_all("td"):
        # prints all td tags with a text format
        print(k.text)
1text 2text
3text 
4text 5text
6text 
# finds all tr tags
for i in soup.find_all("tr"):
    # finds all td tags in tr tags
    print(i.find_all("td")[1].text)