Python 相同的urllib2-PULSOUP4代码适用于家庭计算机,但不适用于aws ec2服务器
所以,我有一个非常简单的python beautiful soup程序,它可以打印网页的前1000个字符Python 相同的urllib2-PULSOUP4代码适用于家庭计算机,但不适用于aws ec2服务器,python,beautifulsoup,urllib2,Python,Beautifulsoup,Urllib2,所以,我有一个非常简单的python beautiful soup程序,它可以打印网页的前1000个字符 from bs4 import BeautifulSoup import urllib2 def soup_maker(url): class RedirectHandler(urllib2.HTTPRedirectHandler): def http_error_302(self, req, fp, code, msg, headers):
from bs4 import BeautifulSoup
import urllib2
def soup_maker(url):
class RedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
result.status = code
return result
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url,headers=hdr)
opener = urllib2.build_opener(RedirectHandler())
webpage = opener.open(req)
soup = BeautifulSoup(webpage, "html5lib")
return soup
if __name__ == "__main__":
url = 'https://offerupnow.com/'
print str(soup_maker(url))[0:1000]
在我的家用电脑上,它输出:
<!DOCTYPE html>
<html lang="en-US"><head>
<title>OfferUp - Buy. Sell. Simple.</title>
<meta charset="utf-8"/>
<meta content="OfferUp, Offer Up, social shopping, online deals, classifieds, Buy local stuff, Local stuff for sale, Shop local, Local shopping, Local marketplace, Local yard sales, Local garage sales, Gently used baby stuff, Sell locally, Buy locally, Sell stuff online" name="keywords"/>
<meta content="OfferUp is revolutionizing how we sell by making it a snap! Instantly connect with buyers and sellers near you." name="description"/>
<meta content="hVckgnfxPSIIYHASW6k-BapqZdaFc19eRe0nI8CneNM" name="google-site-verification"/>
<meta content="1d7e114ee3af2b13ced8508628f804b9" name="p:domain_verify"/>
<meta content="NOODP" name="robots"/>
<meta content="index,follow" name="robots"/>
<meta content="summary" name="twitter:card"/>
<meta content="@offerup" name="twitter:site"/>
<meta content="OfferUp" name="twi
<html><head>
<meta content="noindex,nofollow" name="robots"/>
<script>
(function() { function getSessionCookies() { cookieArray = new Array(); var cName = /^\s?incap_ses_/; var c = document.cookie.split(";"); for (var i = 0; i < c.length; i++) { key = c[i].substr(0, c[i].indexOf("=")); value = c[i].substr(c[i].indexOf("=") + 1, c[i].length); if (cName.test(key)) { cookieArray[cookieArray.length] = value } } return cookieArray } function setIncapCookie(vArray) { try { cookies = getSessionCookies(); digests = new Array(cookies.length); for (var i = 0; i < cookies.length; i++) { digests[i] = simpleDigest((vArray) + cookies[i]) } res = vArray + ",digest=" + (digests.join()) } catch (e) { res = vArray + ",digest=" + (encodeURIComponent(e.toString())) } createCookie("___utmvc", res, 20) } function simpleDigest(mystr) { var res = 0; for (var i = 0; i < mystr.length; i++) { res += mystr.charCodeAt(i) } return res } fun
报价-购买。卖简单。
您正在抓取的站点可能会阻止来自AWS IP地址的请求,因为这些请求很可能是在抓取机器人。@snakecharmerb我想您可能是对的,有办法解决这个问题吗?