Python 2.7 403使用beautifulsoup时禁止输出_Python 2.7_Web Scraping_Beautifulsoup

Python 2.7 403使用beautifulsoup时禁止输出

python-2.7 web-scraping

Python 2.7 403使用beautifulsoup时禁止输出,python-2.7,web-scraping,beautifulsoup,Python 2.7,Web Scraping,Beautifulsoup,我正试图用beautifulsoup从某个网站上抓取文章。我一直得到“HTTP错误403:禁止”作为输出。我想知道是否有人可以向我解释如何克服这个问题？下面是我的代码： url: http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03 timestamp = datetime.date.today() # Parse HTML of article, aka making soup soup = Bea

我正试图用beautifulsoup从某个网站上抓取文章。我一直得到“HTTP错误403:禁止”作为输出。我想知道是否有人可以向我解释如何克服这个问题？下面是我的代码：

url: http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03

timestamp = datetime.date.today() 

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Check if article is from Magharebia.com
# remaining issues: error 403: forbidden. Possible robots.txt? 
# Can't scrape anything atm
if "magharebia.com" in url:

# Create a new file to write content to
#txt = open('%s.txt' % timestamp, "wb")

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Write the article title to the file    
try:
    title = soup.find("h2")
    txt.write('\n' + "Title: " + str(title) + '\n' + '\n')
except:
    print "Could not find the title!"

# Author/Location/Date
try:
    artinfo = soup.find("h4").text
    txt.write("Author/Location/Date: " + str(artinfo) + '\n' + '\n')
except:
    print "Could not find the article info!" 

# Retrieve all of the paragraphs
tags = soup.find("div", {'class': 'body en_GB'}).find_all('p')
for tag in tags:
    txt.write(tag.text.encode('utf-8') + '\n' + '\n')

# Close txt file with new content added
txt.close()



           Please enter a valid URL: http://magharebia.com/en_GB/articles/awi/features/2014/04       /14/feature-03
Traceback (most recent call last):
  File "idle_test.py", line 18, in <module>
    soup = BeautifulSoup(urllib2.urlopen(url).read())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in     urlopen
    return _opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in    http_response
    'http', request, response, code, msg, hdrs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in     _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 531, in     http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

url:http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03
timestamp=datetime.date.today（）
#解析文章的HTML，又名做汤
soup=BeautifulSoup（urlib2.urlopen（url.read（））
#检查文章是否来自Magharebia.com
#剩余问题：错误403：禁止。可能的robots.txt？
#不能在atm机上刮东西
如果url中有“magharebia.com”：
#创建新文件以将内容写入其中
#txt=打开（“%s.txt”%timestamp，“wb”）
#解析文章的HTML，又名做汤
soup=BeautifulSoup（urlib2.urlopen（url.read（））
#将文章标题写入文件
尝试：
title=soup.find（“h2”）
txt.write（'\n'+'标题：“+str（标题）+'\n'+'\n'）
除：
“打印”无法找到标题！"
#作者/地点/日期
尝试：
artinfo=soup.find（“h4”）.text
txt.write（“作者/地点/日期：”+str（artinfo）+'\n'+'\n'）
除：
打印“找不到文章信息！" 
#检索所有段落
tags=soup.find（'div'，{'class'：'body en_GB'}）。find_all（'p'）
对于标记中的标记：
txt.write（tag.text.encode（'utf-8'）+'\n'+'\n'）
#关闭添加了新内容的txt文件
txt.close（）
请输入有效的URL:http://magharebia.com/en_GB/articles/awi/features/2014/04       /14/3-03
回溯（最近一次呼叫最后一次）：
文件“idle_test.py”，第18行，在
soup=BeautifulSoup（urlib2.urlopen（url.read（））
urlopen中的文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”，第127行
return\u opener.open（url、数据、超时）
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”，第410行，打开
响应=方法（请求，响应）
http_响应中的文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”，第523行
“http”、请求、响应、代码、消息、hdrs）
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”，第448行出错
返回自我。调用链（*args）
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”，第382行，在调用链中
结果=func（*args）
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”，第531行，http\u error\u default
raise HTTPError（请求获取完整url（），代码，消息，hdrs，fp）
urllib2.HTTPError:HTTP错误403:禁止

我能够使用urllib2重现403禁止的错误，我没有深入研究，但以下内容对我有效：

import requests
from bs4 import BeautifulSoup

url = "http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03"

soup = BeautifulSoup(requests.get(url).text)

print soup # prints the HTML you are expecting

我能够使用urllib2复制403禁止的错误，我没有深入研究它，但是以下内容对我有用：

import requests
from bs4 import BeautifulSoup

url = "http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03"

soup = BeautifulSoup(requests.get(url).text)

print soup # prints the HTML you are expecting

你能粘贴整个回溯吗？回溯与控制台上的输出不同吗？不，回溯是控制台上显示的错误。你能粘贴整个回溯吗？回溯与控制台上的输出不同吗？不，回溯是控制台上显示的错误。W的可能重复你能解释一下为什么我以前使用的urllib2 urlopen会起作用吗？403来自该网站。它可能不喜欢urllib2的默认用户代理，但不会禁止“请求”“默认代理。您能解释一下为什么我以前使用的urllib2 urlopen上会出现这种情况吗？403来自该网站。可能它不喜欢urllib2的默认用户代理，但不会禁止“请求”默认代理。