如何使用循环从表中提取数据以使用python获取所有td数据
所以我试图从一个网站上获取一些数据。我很难得到数据。我可以得到球员的名字,但现在就这么多了。我一直在尝试不同的事情,但结果却不尽如人意。下面是我正在尝试的示例代码。请注意,有两个表(每个团队一个)。每个播放器的类从“偶数”到“奇数”或从“奇数”到“偶数”的交替示例html文件,下面是我的python脚本。我给我想要的零件贴上标签。我也在使用python 2.7如何使用循环从表中提取数据以使用python获取所有td数据,python,html,web-scraping,Python,Html,Web Scraping,所以我试图从一个网站上获取一些数据。我很难得到数据。我可以得到球员的名字,但现在就这么多了。我一直在尝试不同的事情,但结果却不尽如人意。下面是我正在尝试的示例代码。请注意,有两个表(每个团队一个)。每个播放器的类从“偶数”到“奇数”或从“奇数”到“偶数”的交替示例html文件,下面是我的python脚本。我给我想要的零件贴上标签。我也在使用python 2.7 `<table id="nbaGITeamStats" cellpadding="0" cellspacing="0">
`<table id="nbaGITeamStats" cellpadding="0" cellspacing="0">
<thead class="nbaGIClippers">
<tr>
<th colspan="17">Los Angeles Clippers (1-0)</th> <!-- I want team name -->
</tr>
</thead>
<tbody><tr colspan="17">
<td colspan="17" class="nbaGIBoxCat"><span>field goals</span><span>rebounds</span></td>
</tr>
<tr>
<td class="nbaGITeamHdrStatsNoBord" colspan="1"> </td>
<td class="nbaGITeamHdrStats">pos</td>
<td class="nbaGITeamHdrStats">min</td>
<td class="nbaGITeamHdrStats">fgm-a</td>
<td class="nbaGITeamHdrStats">3pm-a</td>
<td class="nbaGITeamHdrStats">ftm-a</td>
<td class="nbaGITeamHdrStats">+/-</td>
<td class="nbaGITeamHdrStats">off</td>
<td class="nbaGITeamHdrStats">def</td>
<td class="nbaGITeamHdrStats">tot</td>
<td class="nbaGITeamHdrStats">ast</td>
<td class="nbaGITeamHdrStats">pf</td>
<td class="nbaGITeamHdrStats">st</td>
<td class="nbaGITeamHdrStats">to</td>
<td class="nbaGITeamHdrStats">bs</td>
<td class="nbaGITeamHdrStats">ba</td>
<td class="nbaGITeamHdrStats">pts</td>
</tr>
<tr class="odd">
<td id="nbaGIBoxNme" class="b"><a href="/playerfile/paul_pierce/index.html">P. Pierce</a></td> <!-- I want player name -->
<td class="nbaGIPosition">F</td> <!-- I want position name -->
<td>14:16</td> <!-- I want this -->
<td>1-4</td> <!-- I want this -->
<td>1-2</td> <!-- I want this -->
<td>2-2</td> <!-- I want this -->
<td>+12</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>3</td> <!-- I want this -->
<td>2</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>5</td> <!-- I want this -->
</tr>
<tr class="even">
<td id="nbaGIBoxNme" class="b"><a href="/playerfile/blake_griffin/index.html">B. Griffin</a></td> <!-- I want this -->
<td class="nbaGIPosition">F</td> <!-- I want this -->
<td>26:19</td> <!-- I want this -->
<td>5-14</td> <!-- I want this -->
<td>0-1</td> <!-- I want this -->
<td>1-1</td> <!-- I want this -->
<td>+14</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>5</td> <!-- I want this -->
<td>5</td> <!-- I want this -->
<td>2</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>11</td> <!-- I want this -->
</tr>
<tr class="odd">
<td id="nbaGIBoxNme" class="b"><a href="/playerfile/deandre_jordan/index.html">D. Jordan</a></td> <!-- I want this -->
<td class="nbaGIPosition">C</td> <!-- I want this -->
<td>26:27</td> <!-- I want this -->
<td>6-7</td> <!-- I want this -->
<td>0-0</td> <!-- I want this -->
<td>3-5</td> <!-- I want this -->
<td>+19</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>11</td> <!-- I want this -->
<td>12</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>1</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>2</td> <!-- I want this -->
<td>3</td> <!-- I want this -->
<td>0</td> <!-- I want this -->
<td>15</td> <!-- I want this -->
</tr>
<!-- And so on it will keep changing class from odd to even, even to odd -->
<!-- Also note there are to tables one for each team -->
<!--this is he table id>>> <table id="nbaGITeamStats" cellpadding="0" cellspacing="0"> -->`
这样写是正确的:
for tr in soup.find_all('table', id='nbaGITeamStats')
这对我来说很好(python 3.4):
要访问td标签中的内容,请使用.text,如下所示:
for td in tds:
print(td.text)
这是我的解决办法。请注意,我有一个稍微不同版本的BeautifulSoup,它不是来自bs4,但逻辑可能不太正确。仍然在Python2.7上(在我的例子中是在Windows上) 您可能需要修复一些与上面显示的不同的播放器部分的细微差别,但我认为您将能够处理该部分:-) 对于bs4(我了解到的BeautifulSoup4),必须进行一些修改。您仍然需要处理一些内容,但这会提取您需要的大部分数据:
import urllib
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
url = "http://www.nba.com/"+game
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
# fetch the tables you are interested in
tables = soup.findAll(id="nbaGITeamStats")
for table in tables:
team_name = table.thead.tr.th.text
# odd/even class rows (tr)
rows = table.find_all(attrs={'class':'odd'})
rows.extend(table.find_all(attrs={'class':'even'}))
for player in rows:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_position, player_numbers
这就是我所做的一切。当然,我必须从这里清理代码,这是在sal的大力帮助下完成的
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
url = "http://www.nba.com/"+game
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
# fetch the tables you are interested in
tables = soup.findAll(id="nbaGITeamStats")
for table in tables:
team_name = table.thead.tr.th.text
# odd/even class rows (tr)
rowsodd = table.find_all(attrs={'class':'odd'})
rowseven =table.find_all(attrs={'class':'even'})
for player in rowsodd:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
#player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
#^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_numbers
for player in rowseven:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
#player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
#^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_numbers
现在一切都显示出来了。我得把它清理干净一点。但数据要干净得多。从这个问题上你可以看出,我从来没有喝过这么好的汤。需要两行,或者也许有人知道更好的方法,这对我来说是最容易获得我一直在寻找的数据的方法,尽管我一直在寻求改进。我希望其他人能从中吸取教训 谢谢你,这对tds有效,我正在尝试找出如何在14:16之间获得td,有没有办法通过td的编号来锁定点?是的,你可以通过呼叫访问14:16。所需td上的文本。只要数一数你需要哪一个,或者做一些条件来获得它。这似乎是试图像我想要的那样把它固定下来,但我认为它不适用于我的beatifulsoup版本。我会尝试对它进行一些调整,不过感谢您的回复。如果它有帮助,我通过
pip安装BeautifulSoup
安装了BeautifulSoup。我使用的是Windows10,Python2.7。出于某种原因,这不会打印任何内容。我用了你提供的第二部分。但我什么也没打印出来。我可以打印表格,它会显示数据。然后我可以打印团队名称。但当我转到下一行时,它会显示空列表。如果我把球队的名字放在底部,球员的名字和其他所有东西都不会打印出来,这很奇怪。我一字不差地复制了代码,运行良好,直到它崩溃(我提到你需要修复一些东西),但它确实打印了这么多
import urllib
import urllib2
# from bs4 import BeautifulSoup
from BeautifulSoup import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
url = "http://www.nba.com/"+game
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
# fetch the tables you are interested in
tables = soup.findAll(id="nbaGITeamStats")
for table in tables:
team_name = table.thead.tr.th.text
# odd/even class rows (tr)
rows = [ x for x in table.findAll('tr') if x.get('class',None) in ['odd','even'] ]
for player in rows:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_position, player_numbers
import urllib
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
url = "http://www.nba.com/"+game
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
# fetch the tables you are interested in
tables = soup.findAll(id="nbaGITeamStats")
for table in tables:
team_name = table.thead.tr.th.text
# odd/even class rows (tr)
rows = table.find_all(attrs={'class':'odd'})
rows.extend(table.find_all(attrs={'class':'even'}))
for player in rows:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_position, player_numbers
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
url = "http://www.nba.com/"+game
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
# fetch the tables you are interested in
tables = soup.findAll(id="nbaGITeamStats")
for table in tables:
team_name = table.thead.tr.th.text
# odd/even class rows (tr)
rowsodd = table.find_all(attrs={'class':'odd'})
rowseven =table.find_all(attrs={'class':'even'})
for player in rowsodd:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
#player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
#^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_numbers
for player in rowseven:
# search the row cols based on 'id'
player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
# search the row cols based on 'class'
#player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
#^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
# search for all td where the class is not defined
player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
print player_name, player_numbers