Python 如何在beautifulsoup中生成url字符串
我看了一些教程,读了一本关于beautifulsoup的基础知识的书,写了这个刮板,但无法让它循环浏览URL a-z或浏览页面。对于这个项目,我刮一个网站,我希望能够有它刮a-Z,而不仅仅是一个页面的结果 下面的代码一直在工作,直到我试图让它生成最后一个字母字符串-- 下面是我不工作的代码-我试图构建url字符串。理想情况下,我会喜欢从一个文件或预定义的列表以及婴儿步骤拉 我的错误如下 ---------------------Python 如何在beautifulsoup中生成url字符串,python,beautifulsoup,Python,Beautifulsoup,我看了一些教程,读了一本关于beautifulsoup的基础知识的书,写了这个刮板,但无法让它循环浏览URL a-z或浏览页面。对于这个项目,我刮一个网站,我希望能够有它刮a-Z,而不仅仅是一个页面的结果 下面的代码一直在工作,直到我试图让它生成最后一个字母字符串-- 下面是我不工作的代码-我试图构建url字符串。理想情况下,我会喜欢从一个文件或预定义的列表以及婴儿步骤拉 我的错误如下 --------------------- 该特定站点没有“x”。因此,您将获得404。用try包装它,这样它
该特定站点没有“x”。因此,您将获得404。用try包装它,这样它将跳过404页,并且应该可以工作
playerdatasaved=""
for letter in ascii_lowercase:
try:
soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/")
for record in soup.find_all("tr"):
playerdata=""
for data in record.findAll("td"):
playerdata=playerdata+","+data.text
if len(playerdata)!=0:
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
except:
pass
错误表明找不到带有该url的响应,因此您正在生成一个在相应服务器中不可用的url。我也这么认为,但您能告诉我原因吗?任何带有a-z的url都应该是有效的-我的设置有什么问题?我发布了第一个版本,爬行一个url-需要多个。我不能告诉你为什么服务器决定按照程序员的预期方式工作。同样在你的例子中,根据你的回答,看起来“X”不可用(404)。我想你不明白我的目标。标题和我的问题是“如何在beautifulsoup中生成url字符串”。我有一个url的第一部分工作,还有我的尝试发布,似乎非常接近工作,但很明显,我如何生成url的最后一部分出了问题。
Traceback (most recent call last):
File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 15, in <module>
soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/")
File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 8, in make_soup
thepage = urllib.request.urlopen(url)
File "C:\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python36\lib\urllib\request.py", line 564, in error
result = self._call_chain(*args)
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python36\lib\urllib\request.py", line 756, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
playerdatasaved=""
soup = make_soup("http://www.basketball-reference.com/players/a/")
for record in soup.find_all("tr"):
playerdata = ""
for data in record.findAll("td"):
playerdata=playerdata+","+data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header="Player,From,To,Pos,Ht,Wt,Birth Date,College"+"\n"
file = open(os.path.expanduser("Basketball.csv"),"wb")
file.write(bytes(header, encoding="ascii",errors="ignore"))
file.write(bytes(playerdatasaved, encoding="ascii",errors="ignore"))
print(playerdatasaved)
playerdatasaved=""
for letter in ascii_lowercase:
try:
soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/")
for record in soup.find_all("tr"):
playerdata=""
for data in record.findAll("td"):
playerdata=playerdata+","+data.text
if len(playerdata)!=0:
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
except:
pass