Python 使用BeautifulSoup在Div中查找表
我试图写一些东西,提取NFL分数的价差。以下是所有比赛的职业足球参考资料,我正试图更进一步。我要刮取的示例页面如下: 到目前为止,我的代码是:Python 使用BeautifulSoup在Div中查找表,python,beautifulsoup,Python,Beautifulsoup,我试图写一些东西,提取NFL分数的价差。以下是所有比赛的职业足球参考资料,我正试图更进一步。我要刮取的示例页面如下: 到目前为止,我的代码是: def get_spread(row): a = row.findAll('a',href=True) box_link = 'https://www.pro-football-reference.com/'+a[-1]['href'] temp_soup = BeautifulSoup(urlopen(box_link),'html.par
def get_spread(row):
a = row.findAll('a',href=True)
box_link = 'https://www.pro-football-reference.com/'+a[-1]['href']
temp_soup = BeautifulSoup(urlopen(box_link),'html.parser')
table = temp_soup.find('div', {'id':'all_game_info'})
return table
其中,行定义为soup.findAll('tbody',limit=1)[0]。findAll('tr')[0://code>
忽略这一点并尝试只刮取那个示例页面,如果我使用table=temp\u soup.find('div',{'id':'all\u game\u info')
,我得到table
是
<div class="table_wrapper setup_commented commented" id="all_game_info">
<div class="section_heading">
<span class="section_anchor" data-label="Game Info" id="game_info_link"></span><h2>Game Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div>
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_game_info">
<table class="suppress_all sortable stats_table" id="game_info" data-cols-to-freeze="0"><caption>Game Info Table</caption><tr class="thead onecell" ><td class="right center" data-stat="onecell" colspan="2" >Game Info</td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Won Toss</th><td class="center " data-stat="stat" >Chiefs (deferred)</td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Roof</th><td class="center " data-stat="stat" >outdoors</td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Surface</th><td class="center " data-stat="stat" >fieldturf </td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Duration</th><td class="center " data-stat="stat" >3:37</td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Attendance</th><td class="center " data-stat="stat" ><a href="/years/2017/attendance.htm">65,878</a></td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Weather</th><td class="center " data-stat="stat" >63 degrees, wind 8 mph</td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Vegas Line</th><td class="center " data-stat="stat" >New England Patriots -8.0</td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Over/Under</th><td class="center " data-stat="stat" >47.5 <b>(over)</b></td></tr>
</table>
</div>
</div>
-->
</div>
游戏信息
我想要最后两个('Vegas Line'和'Over/Under'),但是如果我运行table.findall('tr')
,它将返回None,就像我尝试查找'td'、'table'、'th'一样。因此,我很好奇如何从表变量中提取这些值。该
位于HTML注释(
)中,因此需要额外的步骤来提取它:
import requests
from bs4 import BeautifulSoup, Comment
url = 'https://www.pro-football-reference.com/boxscores/201709070nwe.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
table = soup.select_one('h2:contains("Game Info")').find_next(text=lambda t: isinstance(t, Comment))
# load <table> from HTML comments <!-- ... -->
soup = BeautifulSoup(str(table), 'html.parser')
vegas_line = soup.select_one('th:contains("Vegas Line")').find_next('td').text
over_under = soup.select_one('th:contains("Over/Under")').find_next('td').text
print(vegas_line)
print(over_under)
哦我不知道关于HTML。我想这就是为什么它没有出现在soup.findall('table')中的原因?@yankefan11是的,这正是原因。这给了我一个“未实现的错误:只实现了以下伪类:类型的第n个”。表=行?@yankefan11您使用的是beautifulsou
的古老版本。我正在使用版本beautifulsoup4==4.9.1
尝试更新模块。谢谢。谷歌Colab不是最新的。
New England Patriots -8.0
47.5 (over)