Python 2.7 403使用beautifulsoup时禁止输出
我正试图用beautifulsoup从某个网站上抓取文章。我一直得到“HTTP错误403:禁止”作为输出。我想知道是否有人可以向我解释如何克服这个问题?下面是我的代码:Python 2.7 403使用beautifulsoup时禁止输出,python-2.7,web-scraping,beautifulsoup,Python 2.7,Web Scraping,Beautifulsoup,我正试图用beautifulsoup从某个网站上抓取文章。我一直得到“HTTP错误403:禁止”作为输出。我想知道是否有人可以向我解释如何克服这个问题?下面是我的代码: url: http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03 timestamp = datetime.date.today() # Parse HTML of article, aka making soup soup = Bea
url: http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03
timestamp = datetime.date.today()
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Check if article is from Magharebia.com
# remaining issues: error 403: forbidden. Possible robots.txt?
# Can't scrape anything atm
if "magharebia.com" in url:
# Create a new file to write content to
#txt = open('%s.txt' % timestamp, "wb")
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Write the article title to the file
try:
title = soup.find("h2")
txt.write('\n' + "Title: " + str(title) + '\n' + '\n')
except:
print "Could not find the title!"
# Author/Location/Date
try:
artinfo = soup.find("h4").text
txt.write("Author/Location/Date: " + str(artinfo) + '\n' + '\n')
except:
print "Could not find the article info!"
# Retrieve all of the paragraphs
tags = soup.find("div", {'class': 'body en_GB'}).find_all('p')
for tag in tags:
txt.write(tag.text.encode('utf-8') + '\n' + '\n')
# Close txt file with new content added
txt.close()
Please enter a valid URL: http://magharebia.com/en_GB/articles/awi/features/2014/04 /14/feature-03
Traceback (most recent call last):
File "idle_test.py", line 18, in <module>
soup = BeautifulSoup(urllib2.urlopen(url).read())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
url:http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03
timestamp=datetime.date.today()
#解析文章的HTML,又名做汤
soup=BeautifulSoup(urlib2.urlopen(url.read())
#检查文章是否来自Magharebia.com
#剩余问题:错误403:禁止。可能的robots.txt?
#不能在atm机上刮东西
如果url中有“magharebia.com”:
#创建新文件以将内容写入其中
#txt=打开(“%s.txt”%timestamp,“wb”)
#解析文章的HTML,又名做汤
soup=BeautifulSoup(urlib2.urlopen(url.read())
#将文章标题写入文件
尝试:
title=soup.find(“h2”)
txt.write('\n'+'标题:“+str(标题)+'\n'+'\n')
除:
“打印”无法找到标题!"
#作者/地点/日期
尝试:
artinfo=soup.find(“h4”).text
txt.write(“作者/地点/日期:”+str(artinfo)+'\n'+'\n')
除:
打印“找不到文章信息!"
#检索所有段落
tags=soup.find('div',{'class':'body en_GB'})。find_all('p')
对于标记中的标记:
txt.write(tag.text.encode('utf-8')+'\n'+'\n')
#关闭添加了新内容的txt文件
txt.close()
请输入有效的URL:http://magharebia.com/en_GB/articles/awi/features/2014/04 /14/3-03
回溯(最近一次呼叫最后一次):
文件“idle_test.py”,第18行,在
soup=BeautifulSoup(urlib2.urlopen(url.read())
urlopen中的文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第127行
return\u opener.open(url、数据、超时)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第410行,打开
响应=方法(请求,响应)
http_响应中的文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第523行
“http”、请求、响应、代码、消息、hdrs)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第448行出错
返回自我。调用链(*args)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第382行,在调用链中
结果=func(*args)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第531行,http\u error\u default
raise HTTPError(请求获取完整url(),代码,消息,hdrs,fp)
urllib2.HTTPError:HTTP错误403:禁止
我能够使用urllib2重现403禁止的错误,我没有深入研究,但以下内容对我有效:
import requests
from bs4 import BeautifulSoup
url = "http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03"
soup = BeautifulSoup(requests.get(url).text)
print soup # prints the HTML you are expecting
我能够使用urllib2复制403禁止的错误,我没有深入研究它,但是以下内容对我有用:
import requests
from bs4 import BeautifulSoup
url = "http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03"
soup = BeautifulSoup(requests.get(url).text)
print soup # prints the HTML you are expecting
你能粘贴整个回溯吗?回溯与控制台上的输出不同吗?不,回溯是控制台上显示的错误。你能粘贴整个回溯吗?回溯与控制台上的输出不同吗?不,回溯是控制台上显示的错误。W的可能重复你能解释一下为什么我以前使用的urllib2 urlopen会起作用吗?403来自该网站。它可能不喜欢urllib2的默认用户代理,但不会禁止“请求”“默认代理。您能解释一下为什么我以前使用的urllib2 urlopen上会出现这种情况吗?403来自该网站。可能它不喜欢urllib2的默认用户代理,但不会禁止“请求”默认代理。