Python 当<；tr>；有行距_Python_Html_Pandas_Beautifulsoup

Python 当<；tr>；有行距

python html pandas

Python 当<；tr>；有行距,python,html,pandas,beautifulsoup,Python,Html,Pandas,Beautifulsoup,如果行具有rowspan元素，则如何使该行与wikipedia页面中的表相对应 from bs4 import BeautifulSoup import urllib2 from lxml.html import fromstring import re import csv import pandas as pd wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records" header = {'Use

如果行具有rowspan元素，则如何使该行与wikipedia页面中的表相对应

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring 
import re
import csv
import pandas as pd

wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

try:
    table = soup.find_all('table')[6]
except AttributeError as e:
    print 'No tables found, exiting'

try:
    first = table.find_all('tr')[0]
except AttributeError as e:
    print 'No table row found, exiting'

try:
    allRows = table.find_all('tr')[1:-1]
except AttributeError as e:
    print 'No table row found, exiting'


headers = [header.get_text() for header in first.find_all(['th', 'td'])]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]


df = pd.DataFrame(data=results, columns=headers)
df

我得到这个表作为输出。。但对于行包含rowspan的表，我得到如下表-

如您所知，由于以下情况导致的问题

html内容：

<tr>
     <td rowspan="2">2=</td>
     <td>West Indies</td>
     <td>4</td>
     <td>Lord's</td>
     <td>2009</td>
</tr>
<tr>
     <td style="text-align:left;">India</td>
     <td>4</td>
     <td>Mumbai</td>
      <td>2012</td>
</tr>

输出：

  Rank       Opponent No. wins Most recent venue Season
0    1   South Africa        6            Lord's   1951
1   2=    West Indies        4            Lord's   2009
2   2=          India        4            Mumbai   2012
3    4      Australia        3            Sydney   1932
4    5       Pakistan        2      Trent Bridge   1967
5    6      Sri Lanka        1      Old Trafford   2002

one  two         three
one  February    February

表10的工作也是如此

  Rank Hundreds            Player Matches Innings Average
0    1       25     Alastair Cook     107     191   45.61
1    2       23   Kevin Pietersen     104     181   47.28
2    3       22     Colin Cowdrey     114     188   44.07
3    3       22     Wally Hammond      85     140   58.46
4    3       22  Geoffrey Boycott     108     193   47.72
5    6       21    Andrew Strauss     100     178   40.91
6    6       21          Ian Bell     103     178   45.30
7   8=       20    Ken Barrington      82     131   58.67
8   8=       20      Graham Gooch     118     215   42.58
9   10       19        Len Hutton      79     138   56.67

输入：

python代码：

# !/bin/python3
# coding: utf-8
from bs4 import BeautifulSoup


class Element(object):
    def __init__(self, row, col, text, rowspan=1, colspan=1):
        self.row = row
        self.col = col
        self.text = text
        self.rowspan = rowspan
        self.colspan = colspan

    def __repr__(self):
        return f'''{{"row": {self.row}, "col":  {self.col}, "text": {self.text}, "rowspan": {self.rowspan}, "colspan": {self.colspan}}}'''

    def isRowspan(self):
        return self.rowspan > 1

    def isColspan(self):
        return self.colspan > 1


def parse(h) -> [[]]:
    doc = BeautifulSoup(h, 'html.parser')

    trs = doc.select('tr')

    m = []

    for row, tr in enumerate(trs):  # collect Node, rowspan node, colspan node
        it = []
        ts = tr.find_all(['th', 'td'])
        for col, tx in enumerate(ts):
            element = Element(row, col, tx.text.strip())
            if tx.has_attr('rowspan'):
                element.rowspan = int(tx['rowspan'])
            if tx.has_attr('colspan'):
                element.colspan = int(tx['colspan'])
            it.append(element)
        m.append(it)

    def solveColspan(ele):
        row, col, text, rowspan, colspan = ele.row, ele.col, ele.text, ele.rowspan, ele.colspan
        m[row].insert(col + 1, Element(row, col, text, rowspan, colspan - 1))
        for column in range(col + 1, len(m[row])):
            m[row][column].col += 1

    def solveRowspan(ele):
        row, col, text, rowspan, colspan = ele.row, ele.col, ele.text, ele.rowspan, ele.colspan
        offset = row + 1
        m[offset].insert(col, Element(offset, col, text, rowspan - 1, 1))
        for column in range(col + 1, len(m[offset])):
            m[offset][column].col += 1

    for row in m:
        for ele in row:
            if ele.isColspan():
                solveColspan(ele)
            if ele.isRowspan():
                solveRowspan(ele)
    return m


def prettyPrint(m):
    for i in m:
        it = [f'{len(i)}']
        for index, j in enumerate(i):
            if j.text != '':
                it.append(f'{index:2} {j.text[:4]:4}')
        print(' --- '.join(it))


with open('./index.html', 'rb') as f:
    index = f.read()
html = index.decode('utf-8')
matrix = parse(html)
prettyPrint(matrix)

在stackoverflow或web上找到的解析器都不适合我——它们都错误地解析了来自Wikipedia的我的表。这是一个真正有效且简单的解析器。干杯

定义解析器函数：

def pre_process_table(table):
    """
    INPUT:
        1. table - a bs4 element that contains the desired table: ie <table> ... </table>
    OUTPUT:
        a tuple of: 
            1. rows - a list of table rows ie: list of <tr>...</tr> elements
            2. num_rows - number of rows in the table
            3. num_cols - number of columns in the table
    Options:
        include_td_head_count - whether to use only th or th and td to count number of columns (default: False)
    """
    rows = [x for x in table.find_all('tr')]

    num_rows = len(rows)

    # get an initial column count. Most often, this will be accurate
    num_cols = max([len(x.find_all(['th','td'])) for x in rows])

    # sometimes, the tables also contain multi-colspan headers. This accounts for that:
    header_rows_set = [x.find_all(['th', 'td']) for x in rows if len(x.find_all(['th', 'td']))>num_cols/2]

    num_cols_set = []

    for header_rows in header_rows_set:
        num_cols = 0
        for cell in header_rows:
            row_span, col_span = get_spans(cell)
            num_cols+=len([cell.getText()]*col_span)

        num_cols_set.append(num_cols)

    num_cols = max(num_cols_set)

    return (rows, num_rows, num_cols)


def get_spans(cell):
        """
        INPUT:
            1. cell - a <td>...</td> or <th>...</th> element that contains a table cell entry
        OUTPUT:
            1. a tuple with the cell's row and col spans
        """
        if cell.has_attr('rowspan'):
            rep_row = int(cell.attrs['rowspan'])
        else: # ~cell.has_attr('rowspan'):
            rep_row = 1
        if cell.has_attr('colspan'):
            rep_col = int(cell.attrs['colspan'])
        else: # ~cell.has_attr('colspan'):
            rep_col = 1 

        return (rep_row, rep_col)

def process_rows(rows, num_rows, num_cols):
    """
    INPUT:
        1. rows - a list of table rows ie <tr>...</tr> elements
    OUTPUT:
        1. data - a Pandas dataframe with the html data in it
    """
    data = pd.DataFrame(np.ones((num_rows, num_cols))*np.nan)
    for i, row in enumerate(rows):
        try:
            col_stat = data.iloc[i,:][data.iloc[i,:].isnull()].index[0]
        except IndexError:
            print(i, row)

        for j, cell in enumerate(row.find_all(['td', 'th'])):
            rep_row, rep_col = get_spans(cell)

            #print("cols {0} to {1} with rep_col={2}".format(col_stat, col_stat+rep_col, rep_col))
            #print("\trows {0} to {1} with rep_row={2}".format(i, i+rep_row, rep_row))

            #find first non-na col and fill that one
            while any(data.iloc[i,col_stat:col_stat+rep_col].notnull()):
                col_stat+=1

            data.iloc[i:i+rep_row,col_stat:col_stat+rep_col] = cell.getText()
            if col_stat<data.shape[1]-1:
                col_stat+=rep_col

    return data

def main(table):
    rows, num_rows, num_cols = pre_process_table(table)
    df = process_rows(rows, num_rows, num_cols)
    return(df)

我上面的解析器将准确地解析表，例如，表，而所有其他的解析器都无法在多个点重新创建表

在简单情况下-更简单的解决方案如果是一个具有

rowspan

属性的格式非常好的表，那么上述问题可能有一个更简单的解决方案

Pandas

有一个相当强大的

read_html

函数，可以解析提供的

html

表，并且似乎可以很好地处理

rowspan

（无法解析威斯康辛州的东西）

fillna（method='ffill'）

然后可以填充未填充的行。请注意，这不一定适用于跨列空间。还要注意的是，清理后将是必要的

考虑html代码：

    s = """<table width="100%" border="1">
    <tr>
        <td rowspan="1">one</td>
        <td rowspan="2">two</td>
        <td rowspan="3">three</td>
    </tr>
    <tr><td>"4"</td></tr>
    <tr>
        <td>"55"</td>
        <td>"99"</td>
    </tr>
    </table>
    """

然后填充NAs

In [30]: df.fillna(method='ffill')
Out[30]:
      0     1      2
0   one   two  three
1   "4"   two  three
2  "55"  "99"  three

pandas>=0.24.0理解

colspan

和

rowspan

属性，如中所述。要提取之前给您带来问题的wikipage表，以下工作可以完成

import pandas as pd


# Extract all tables from the wikipage
dfs = pd.read_html("http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records")
# The table referenced above is the 7th on the wikipage
df = dfs[6]
# The last row is just the date of the last update
df = df.iloc[:-1]

输出：

Rank获胜对手最近的比赛日期
01 6南非勋爵，英国伦敦，1951年6月21日
1=2 4印度万凯德体育场，印度孟买，2012年11月23日
2=2 4西印度群岛勋爵酒店，英国伦敦，2009年5月6日
澳大利亚悉尼板球场，澳大利亚悉尼1932年12月2日
4 5 2巴基斯坦特伦特桥，英格兰诺丁汉，1967年8月10日
5 6 1斯里兰卡老特拉福德板球场，英格兰曼彻斯特，2002年6月13日

您可能应该解释代码的功能，以及它如何帮助回答问题（例如，它与OP中的代码有何区别）。pandas=>0.24.0正确解释

rowspan

和

colspan

属性。我需要一个完整的答案。

    s = """<table width="100%" border="1">
    <tr>
        <td rowspan="1">one</td>
        <td rowspan="2">two</td>
        <td rowspan="3">three</td>
    </tr>
    <tr><td>"4"</td></tr>
    <tr>
        <td>"55"</td>
        <td>"99"</td>
    </tr>
    </table>
    """

In [16]: df = pd.read_html(s)[0]

In [29]: df
Out[29]:
      0     1      2
0   one   two  three
1   "4"   NaN    NaN
2  "55"  "99"    NaN

In [30]: df.fillna(method='ffill')
Out[30]:
      0     1      2
0   one   two  three
1   "4"   two  three
2  "55"  "99"  three

import pandas as pd


# Extract all tables from the wikipage
dfs = pd.read_html("http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records")
# The table referenced above is the 7th on the wikipage
df = dfs[6]
# The last row is just the date of the last update
df = df.iloc[:-1]