Python 3.x 将一张带有漂亮汤的html表格刮到熊猫身上_Python 3.x_Pandas_Web Scraping_Beautifulsoup

Python 3.x 将一张带有漂亮汤的html表格刮到熊猫身上

python-3.x pandas web-scraping

Python 3.x 将一张带有漂亮汤的html表格刮到熊猫身上,python-3.x,pandas,web-scraping,beautifulsoup,Python 3.x,Pandas,Web Scraping,Beautifulsoup,我正试图用漂亮的汤刮一张html表格，然后把它导入熊猫--“团队击球”表格查找表没有问题： table = soup.find('div', attrs={'class': 'overthrow table_container'}) table_body = table.find('tbody') 查找数据行也不是问题： for i in table.findAll('tr')[2]: #increase to 3 to get next row in table... print(

我正试图用漂亮的汤刮一张html表格，然后把它导入熊猫--“团队击球”表格

查找表没有问题：

table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')

查找数据行也不是问题：

for i in table.findAll('tr')[2]: #increase to 3 to get next row in table...
    print(i.get_text())

我甚至可以找到标题名：

table_head = table.find('thead')

for i in table_head.findAll('th'):
    print(i.get_text())

现在，我很难将所有内容整合到一个数据框架中。以下是我目前掌握的情况：

header = []    
for th in table_head.findAll('th'):
        key = th.get_text()
        header.append(key)

row= []
for tr in table.findAll('tr')[2]:
    value = tr.get_text()
    row.append(value)

od = OrderedDict(zip(head, row))
df = pd.DataFrame(d1, index=[0])

一次仅适用于一行。我的问题是，如何同时对表中的每一行执行此操作？

我已经测试了以下各项是否适用于您的目的。基本上，您需要创建一个列表，在播放器上循环，使用该列表填充数据帧。建议不要逐行创建数据帧，因为这可能会大大降低速度

import collections as co
import pandas as pd

from bs4 import BeautifulSoup

with open('team_batting.html','r') as fin:
    soup = BeautifulSoup(fin.read(),'lxml')

table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')

table_head = table.find('thead')
header = []    
for th in table_head.findAll('th'):
    key = th.get_text()
    header.append(key)

# loop over table to find number of rows with '' in first column
endrows = 0
for tr in table.findAll('tr'):
    if tr.findAll('th')[0].get_text() in (''):
        endrows += 1

rows = len(table.findAll('tr'))
rows -= endrows + 1 # there is a pernicious final row that begins with 'Rk' 

list_of_dicts = []
for row in range(rows):
    the_row = []
    try:
        table_row = table.findAll('tr')[row]
        for tr in table_row:
            value = tr.get_text()
            the_row.append(value)
        od = co.OrderedDict(zip(header,the_row))
        list_of_dicts.append(od)
    except AttributeError:
        continue 

df = pd.DataFrame(list_of_dicts)

这个解决方案只使用了

pandas

，但它通过事先知道团队击球表是第十个表而有点作弊。有了这些知识，下面使用

pandas

的

read_html

函数，并从返回的

DataFrame

对象列表中获取第十个DataFrame。之后剩下的只是一些数据清理：

import pandas as pd url = 'http://www.baseball-reference.com/teams/NYM/2017.shtml' # Take 10th dataframe team_batting = pd.read_html(url)[9] # Take columns whose names don't contain "Unnamed" team_batting.drop([x for x in team_batting.columns if 'Unnamed' in x], axis=1, inplace=True) # Remove the rows that are just a copy of the headers/columns team_batting = team_batting.ix[team_batting.apply(lambda x: x != team_batting.columns,axis=1).all(axis=1),:] # Take out the Totals rows team_batting = team_batting.ix[~team_batting.Rk.isnull(),:] # Get a glimpse of the data print(team_batting.head(5)) # Rk Pos Name Age G PA AB R H 2B ... OBP SLG OPS OPS+ TB GDP HBP SH SF IBB # 0 1 C Travis d'Arnaud 28 12 42 37 6 10 2 ... .357 .541 .898 144 20 1 1 0 0 1 # 1 2 1B Lucas Duda* 31 13 50 42 4 10 2 ... .360 .571 .931 153 24 1 0 0 0 2 # 2 3 2B Neil Walker# 31 14 62 54 5 12 3 ... .306 .278 .584 64 15 2 0 0 1 0 # 3 4 SS Asdrubal Cabrera# 31 15 67 63 10 17 2 ... .313 .397 .710 96 25 0 0 0 0 0 # 4 5 3B Jose Reyes# 34 15 59 53 3 5 2 ... .186 .132 .319 -9 7 0 0 0 0 0

我希望这能有所帮助。
谢谢你的回答。这行中的“[row]”有名字吗：table_row=table.findAll（'tr'）[row]——我以前从未见过像这样在range中使用它。非常欢迎。这只是本例中的索引。相当于
table\u row=table.findAll（'tr'）[0]
或
table\u row=table.findAll（'tr'）[1]
，因此，如果
[row]
被排除在这一行之外，并且您试图在
table\u row
上进行迭代，您将无法做到这一点？您将能够得到所有
tr
而不仅仅是您想要的。