Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/search/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 2.7 403使用beautifulsoup时禁止输出_Python 2.7_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 2.7 403使用beautifulsoup时禁止输出

Python 2.7 403使用beautifulsoup时禁止输出,python-2.7,web-scraping,beautifulsoup,Python 2.7,Web Scraping,Beautifulsoup,我正试图用beautifulsoup从某个网站上抓取文章。我一直得到“HTTP错误403:禁止”作为输出。我想知道是否有人可以向我解释如何克服这个问题?下面是我的代码: url: http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03 timestamp = datetime.date.today() # Parse HTML of article, aka making soup soup = Bea

我正试图用beautifulsoup从某个网站上抓取文章。我一直得到“HTTP错误403:禁止”作为输出。我想知道是否有人可以向我解释如何克服这个问题?下面是我的代码:

url: http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03

timestamp = datetime.date.today() 

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Check if article is from Magharebia.com
# remaining issues: error 403: forbidden. Possible robots.txt? 
# Can't scrape anything atm
if "magharebia.com" in url:

# Create a new file to write content to
#txt = open('%s.txt' % timestamp, "wb")

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Write the article title to the file    
try:
    title = soup.find("h2")
    txt.write('\n' + "Title: " + str(title) + '\n' + '\n')
except:
    print "Could not find the title!"

# Author/Location/Date
try:
    artinfo = soup.find("h4").text
    txt.write("Author/Location/Date: " + str(artinfo) + '\n' + '\n')
except:
    print "Could not find the article info!" 

# Retrieve all of the paragraphs
tags = soup.find("div", {'class': 'body en_GB'}).find_all('p')
for tag in tags:
    txt.write(tag.text.encode('utf-8') + '\n' + '\n')

# Close txt file with new content added
txt.close()



           Please enter a valid URL: http://magharebia.com/en_GB/articles/awi/features/2014/04       /14/feature-03
Traceback (most recent call last):
  File "idle_test.py", line 18, in <module>
    soup = BeautifulSoup(urllib2.urlopen(url).read())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in     urlopen
    return _opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
   File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in    http_response
    'http', request, response, code, msg, hdrs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in     _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 531, in     http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
url:http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03
timestamp=datetime.date.today()
#解析文章的HTML,又名做汤
soup=BeautifulSoup(urlib2.urlopen(url.read())
#检查文章是否来自Magharebia.com
#剩余问题:错误403:禁止。可能的robots.txt?
#不能在atm机上刮东西
如果url中有“magharebia.com”:
#创建新文件以将内容写入其中
#txt=打开(“%s.txt”%timestamp,“wb”)
#解析文章的HTML,又名做汤
soup=BeautifulSoup(urlib2.urlopen(url.read())
#将文章标题写入文件
尝试:
title=soup.find(“h2”)
txt.write('\n'+'标题:“+str(标题)+'\n'+'\n')
除:
“打印”无法找到标题!"
#作者/地点/日期
尝试:
artinfo=soup.find(“h4”).text
txt.write(“作者/地点/日期:”+str(artinfo)+'\n'+'\n')
除:
打印“找不到文章信息!" 
#检索所有段落
tags=soup.find('div',{'class':'body en_GB'})。find_all('p')
对于标记中的标记:
txt.write(tag.text.encode('utf-8')+'\n'+'\n')
#关闭添加了新内容的txt文件
txt.close()
请输入有效的URL:http://magharebia.com/en_GB/articles/awi/features/2014/04       /14/3-03
回溯(最近一次呼叫最后一次):
文件“idle_test.py”,第18行,在
soup=BeautifulSoup(urlib2.urlopen(url.read())
urlopen中的文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第127行
return\u opener.open(url、数据、超时)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第410行,打开
响应=方法(请求,响应)
http_响应中的文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第523行
“http”、请求、响应、代码、消息、hdrs)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第448行出错
返回自我。调用链(*args)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第382行,在调用链中
结果=func(*args)
文件“/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py”,第531行,http\u error\u default
raise HTTPError(请求获取完整url(),代码,消息,hdrs,fp)
urllib2.HTTPError:HTTP错误403:禁止

我能够使用urllib2重现403禁止的错误,我没有深入研究,但以下内容对我有效:

import requests
from bs4 import BeautifulSoup

url = "http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03"

soup = BeautifulSoup(requests.get(url).text)

print soup # prints the HTML you are expecting

我能够使用urllib2复制403禁止的错误,我没有深入研究它,但是以下内容对我有用:

import requests
from bs4 import BeautifulSoup

url = "http://magharebia.com/en_GB/articles/awi/features/2014/04/14/feature-03"

soup = BeautifulSoup(requests.get(url).text)

print soup # prints the HTML you are expecting

你能粘贴整个回溯吗?回溯与控制台上的输出不同吗?不,回溯是控制台上显示的错误。你能粘贴整个回溯吗?回溯与控制台上的输出不同吗?不,回溯是控制台上显示的错误。W的可能重复你能解释一下为什么我以前使用的urllib2 urlopen会起作用吗?403来自该网站。它可能不喜欢urllib2的默认用户代理,但不会禁止“请求”“默认代理。您能解释一下为什么我以前使用的urllib2 urlopen上会出现这种情况吗?403来自该网站。可能它不喜欢urllib2的默认用户代理,但不会禁止“请求”默认代理。