Python 解析html表

Python 解析html表,python,web-scraping,beautifulsoup,html-table,Python,Web Scraping,Beautifulsoup,Html Table,首先,以下是我当前的全部代码: import urllib from BeautifulSoup import BeautifulSoup import sgmllib import re page = 'http://www.sec.gov/Archives/edgar/data/\ 8177/000114036111018563/form10k.htm' sock = urllib.urlopen(page) raw = sock.read() soup = BeautifulSoup(

首先,以下是我当前的全部代码:

import urllib
from BeautifulSoup import BeautifulSoup
import sgmllib
import re

page = 'http://www.sec.gov/Archives/edgar/data/\
8177/000114036111018563/form10k.htm'

sock = urllib.urlopen(page)
raw = sock.read()
soup = BeautifulSoup(raw)

tablelist = soup.findAll('table')

class MyParser(sgmllib.SGMLParser):

def parse(self, segment):
    self.feed(segment)
    self.close()

def __init__(self, verbose=0):
    sgmllib.SGMLParser.__init__(self, verbose)
    self.descriptions = []
    self.inside_td_element = 0
    self.starting_description = 0

def start_td(self, attributes):
    for name, value in attributes:
        if name == "valign":
            self.inside_td_element = 1
            self.starting_description = 1
        else:
            self.inside_td_element = 1
            self.starting_description = 1

def end_td(self):
    self.inside_td_element = 0

def handle_data(self, data):
    if self.inside_td_element:
        if self.starting_description:
            self.descriptions.append(data)
            self.starting_description = 0
        else:
            self.descriptions[-1] += data

def get_descriptions(self):
    return self.descriptions

counter = 0
trlist = []
dtablelist = []

while counter < len(tablelist):
    trsegment = tablelist[counter].findAll('tr')
    trlist.append(trsegment)
    strsegment = str(trsegment)
    myparser = MyParser()
    myparser.parse(strsegment)
    sub = myparser.get_descriptions()
    dtablelist.append(sub)
    counter = counter + 1

ex = []

dtablelist = [s for s in dtablelist if s != ex]
如您所见,输出是每个内容作为各自的字符串,而不是每个表行()的内容列表。所以本质上我想要输出:

[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']]
是因为我必须先将trlist转换为字符串,然后才能使用MyParser对其进行解析吗?有没有人知道如何解决这个问题,允许我解析列表中的列表(又称为初始值)?

使用:

希望这有帮助,干杯

使用:


希望这有帮助,干杯

如果有人正在搜索同一问题的解决方案,但正在使用python 3:

即使您使用的是python 3,也不必使用外部库来解析HTML表。在那里,
SGMLParser
类被
html.parser
中的
HTMLParser
替换。我已经为一个简单的派生
HTMLParser
类编写了代码。它是。它只记得
标记的当前范围。与使用etree相比,它的优点是可以在不符合xml的html上正确运行,并且不使用外部库

您可以通过以下方式使用该类(此处名为
HTMLTableParser
):

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)
其输出是表示表的2D列表列表。看起来可能是这样的:

print trlist[1]
[<tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-    SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div>
</td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td>
</tr>, <tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font>    </font></div>
</td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold"><     <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div>
</div>
</td>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td>
</tr>,...
[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

如果有人正在搜索相同问题的解决方案,但正在使用python 3:

即使您使用的是python 3,也不必使用外部库来解析HTML表。在那里,
SGMLParser
类被
html.parser
中的
HTMLParser
替换。我已经为一个简单的派生
HTMLParser
类编写了代码。它是。它只记得
标记的当前范围。与使用etree相比,它的优点是可以在不符合xml的html上正确运行,并且不使用外部库

您可以通过以下方式使用该类(此处名为
HTMLTableParser
):

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)
其输出是表示表的2D列表列表。看起来可能是这样的:

print trlist[1]
[<tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-    SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div>
</td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td>
</tr>, <tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font>    </font></div>
</td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold"><     <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div>
</div>
</td>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td>
</tr>,...
[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

为什么要使用两个不同的解析器,而不是只使用BeautifulSoup来处理整个问题?(为什么要两次导入BeautifulSoup?)两次导入BeautifulSoup是一个错误。此外,我还使用sgmllib解析,因为当我这样做时:trsegment=tablelist[counter].findAll('tr')。这将返回列表类型的输出,而不是标记或BeautifulSoup类型的输出。为什么要使用两个不同的解析器,而不仅仅是使用BeautifulSoup?(为什么要两次导入BeautifulSoup?)两次导入BeautifulSoup是一个错误。此外,我还使用sgmllib解析,因为当我这样做时:trsegment=tablelist[counter].findAll('tr')。这将返回列表类型输出,而不是标记或美化组类型输出。