在Python中刮取时出错,需要绕过 导入请求 从bs4导入BeautifulSoup 导入csv 从urlparse导入urljoin 导入urllib2 输出文件=打开(“./battingall.csv”,“wb”) writer=csv.writer(输出文件) 基本url=http://www.baseball-reference.com' 播放器http://www.baseball-reference.com/players/' 字母表=['a'、'b'、'c'、'd'、'e'、'f'、'g'、'h'、'i'、'j'、'k'、'l'、'm'、'n'、'o'、'p'、'q'、'r'、's'、't'、'u'、'v'、'w'、'x'、'y'、'z'] 玩家='shtml' gamel='&t=b&year=' 游戏日志http://www.baseball-reference.com/players/gl.cgi?id=' 年份=[‘2015’、‘2014’、‘2013’、‘2012’、‘2011’、‘2010’、‘2009’、‘2008’] 干旱者=[] 对于字母表中的dround: 附加(播放器url+drond) urlz=[] 对于干旱地区的ab: 数据=请求。获取(ab) soup=BeautifulSoup(data.content) 查找所有('a'): 如果link.has_attr('href'): 附加(基本url+链接['href']) yent=[] 对于urlz中的ant: 对于干旱地区的d: 对于y年: 如果玩家使用ant: 如果len(ant)
当我运行此代码以获取gamelog击球数据时,我不断收到一个错误:在Python中刮取时出错,需要绕过 导入请求 从bs4导入BeautifulSoup 导入csv 从urlparse导入urljoin 导入urllib2 输出文件=打开(“./battingall.csv”,“wb”) writer=csv.writer(输出文件) 基本url=http://www.baseball-reference.com' 播放器http://www.baseball-reference.com/players/' 字母表=['a'、'b'、'c'、'd'、'e'、'f'、'g'、'h'、'i'、'j'、'k'、'l'、'm'、'n'、'o'、'p'、'q'、'r'、's'、't'、'u'、'v'、'w'、'x'、'y'、'z'] 玩家='shtml' gamel='&t=b&year=' 游戏日志http://www.baseball-reference.com/players/gl.cgi?id=' 年份=[‘2015’、‘2014’、‘2013’、‘2012’、‘2011’、‘2010’、‘2009’、‘2008’] 干旱者=[] 对于字母表中的dround: 附加(播放器url+drond) urlz=[] 对于干旱地区的ab: 数据=请求。获取(ab) soup=BeautifulSoup(data.content) 查找所有('a'): 如果link.has_attr('href'): 附加(基本url+链接['href']) yent=[] 对于urlz中的ant: 对于干旱地区的d: 对于y年: 如果玩家使用ant: 如果len(ant),python,web-scraping,Python,Web Scraping,当我运行此代码以获取gamelog击球数据时,我不断收到一个错误: import requests from bs4 import BeautifulSoup import csv from urlparse import urljoin import urllib2 outfile = open("./battingall.csv", "wb") writer = csv.writer(outfile) base_url = 'http://www.baseball-reference.co
import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
outfile = open("./battingall.csv", "wb")
writer = csv.writer(outfile)
base_url = 'http://www.baseball-reference.com'
player_url = 'http://www.baseball-reference.com/players/'
alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
players = 'shtml'
gamel = '&t=b&year='
game_logs = 'http://www.baseball-reference.com/players/gl.cgi?id='
years = ['2015','2014','2013','2012','2011','2010','2009','2008']
drounders = []
for dround in alphabet:
drounders.append(player_url + dround)
urlz = []
for ab in drounders:
data = requests.get(ab)
soup = BeautifulSoup(data.content)
for link in soup.find_all('a'):
if link.has_attr('href'):
urlz.append(base_url + link['href'])
yent = []
for ant in urlz:
for d in drounders:
for y in years:
if players in ant:
if len(ant) < 60:
if d in ant:
yent.append(game_logs + ant[44:-6] + gamel + y)
for j in yent:
try:
data = requests.get(j)
soup = BeautifulSoup(data.content)
table = soup.find('table', attrs={'id': 'batting_gamelogs'})
tablea = j[52:59]
tableb= soup.find("b", text='Throws:').next_sibling.strip()
tablec= soup.find("b", text='Height:').next_sibling.strip()
tabled= soup.find("b", text='Weight:').next_sibling.strip()
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
list_of_cells.append(tablea)
list_of_cells.append(j[len(j)-4:])
list_of_cells.append(tableb)
list_of_cells.append(tablec)
list_of_cells.append(tabled)
for cell in row.findAll('td'):
text = cell.text.replace(' ', '').encode("utf-8")
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
print list_of_rows
writer.writerows(list_of_rows)
except (AttributeError,NameError):
pass
回溯(最近一次呼叫最后一次):
文件“battinggamelogs.py”,第44行,在
数据=请求。获取(j)
get中的文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py”,第65行
返回请求('get',url,**kwargs)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py”,请求中的第49行
response=session.request(方法=method,url=url,**kwargs)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site packages/requests/sessions.py”,请求中第461行
resp=自我发送(准备,**发送)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site packages/requests/sessions.py”,第573行,在send中
r=适配器.send(请求,**kwargs)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site packages/requests/adapters.py”,第415行,在send中
raise CONNECTIONERR(错误,请求=请求)
requests.exceptions.ConnectionError:(“连接已中止”,BadStatusLine(“,”))
我需要一种绕过此错误的方法来继续。我认为出现错误的原因是因为没有可从中获取数据的表。您可以在try/except中包装
请求。get()
块。您需要捕获正在生成的
Traceback (most recent call last):
File "battinggamelogs.py", line 44, in <module>
data = requests.get(j)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/requests/api.py", line 65, in get
return request('get', url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/requests/api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 461, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))
发生这种情况是因为连接本身有问题,而不是因为表中没有数据。你还没走那么远
注意:只需使用pass
(后面的代码块中也会这样做),就完全消除了异常。这样做可能更好:
for ab in drounders:
try:
data = requests.get(ab)
soup = BeautifulSoup(data.content)
for link in soup.find_all('a'):
if link.has_attr('href'):
urlz.append(base_url + link['href'])
except requests.exceptions.ConnectionError:
pass
这将在控制台上为您提供一条消息,告诉您哪个URL失败。看起来请求刚刚超时。尝试在浏览器中导航到确切的URL,看看会发生什么。
except requests.exceptions.ConnectionError:
print("Failed to open {}".format(ab))