HTTPS防止Python3中的网站抓取_Python_Https_Web Scraping

HTTPS防止Python3中的网站抓取

python https web-scraping

HTTPS防止Python3中的网站抓取,python,https,web-scraping,Python,Https,Web Scraping,我正在尝试使用Python代码废弃一个网站，但是该网站已经用“https”进行了保护，当运行代码时，它返回以下错误 #-*-编码：utf-8-*- #导入库将urllib.request导入为urllib2 从bs4导入BeautifulSoup #指定url 引述https://www.bloomberg.com/quote/SPX:IND' #查询网站并将html返回到变量“page” page=urlib2.urlopen（引用页面） #使用Beauty soup解析html并存储在变

我正在尝试使用Python代码废弃一个网站，但是该网站已经用“https”进行了保护，当运行代码时，它返回以下错误

#-*-编码：utf-8-*-
#导入库
将urllib.request导入为urllib2
从bs4导入BeautifulSoup
#指定url
引述https://www.bloomberg.com/quote/SPX:IND'
#查询网站并将html返回到变量“page”
page=urlib2.urlopen（引用页面）
#使用Beauty soup解析html并存储在变量'soup'中`
soup=BeautifulSoup（页面“html.parser”）
#取出of name并获取其值
name_box=soup.find（'h1'，attrs={'class'：'companyName'}）
name=name_box.text.strip（）#strip（）用于删除起始和尾随
印刷品（名称）
#获取指数价格
price\u box=soup.find（'div'，attrs={'class'：'price\u c3a38e1d'}）
price=price\u box.text
印刷品（价格）

能否尝试将此添加到代码中？这应该绕过ssl验证

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

你能试着把这个添加到你的代码中吗？这应该绕过ssl验证

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

这里的问题是URL有防刮保护，这会阻止编程HTML提取

尝试获取完整信息

import requests 
from bs4 import BeautifulSoup

#specify the url
quote_page = 'https://www.bloomberg.com/quote/SPX:IND'
result = requests.get(quote_page)
print (result.headers)
#parse the html using beautiful soup and store in variable `soup`
c = result.content
soup = BeautifulSoup(c,"lxml")

print (soup)

输出

{'Cache-Control': 'private, no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html, text/html; charset=utf-8', 'ETag': 'W/"5bae6ca0-97f"', 'Last-Modified': 'Fri, 28 Sep 2018 18:02:08 GMT', 'Server': 'nginx', 'Accept-Ranges': 'bytes, bytes', 'Age': '0, 0', 'Content-Length': '1174', 'Date': 'Sat, 29 Sep 2018 17:03:02 GMT', 'Via': '1.1 varnish', 'Connection': 'keep-alive', 'X-Served-By': 'cache-fra19128-FRA', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1538240583.834133,VS0,VE107', 'Vary': ', Accept-Encoding'}
<html>
<head>
<title>Terms of Service Violation</title>
<style rel="stylesheet" type="text/css">
        .container {
            font-family: Helvetica, Arial, sans-serif;
        }
    </style>
<script>
        window._pxAppId = "PX8FCGYgk4";
        window._pxJsClientSrc = "/8FCGYgk4/init.js";
        window._pxFirstPartyEnabled = true;
        window._pxHostUrl = "/8FCGYgk4/xhr";
        window._pxreCaptchaTheme = "light";

        function qs(name) {
            var search = window.location.search;
            var rx = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)");
            var match = rx.exec(search);
            return match ? decodeURIComponent(match[2].replace(/\+/g, " ")) : null;
        }
    </script>
</head>
<body>
<div class="container">
<img src="https://www.bloomberg.com/graphics/assets/img/BB-Logo-2line.svg" style="margin-bottom: 40px;" width="310"/>
<h1 class="text-center" style="margin: 0 auto;">Terms of Service Violation</h1>
<p>Your usage has been flagged as a violation of our <a href="http://www.bloomberg.com/tos" rel="noopener noreferrer" target="_blank">terms of service</a>.
    </p>
<p>
        For inquiries related to this message please <a href="http://www.bloomberg.com/feedback">contact support</a>.
        For sales
        inquiries, please visit <a href="http://www.bloomberg.com/professional/request-demo">http://www.bloomberg.com/professional/request-demo</a>
</p>
<h3 style="margin: 0 auto;">
        If you believe this to be in error, please confirm below that you are not a robot by clicking "I'm not a robot"
        below.</h3>
<br/>
<div id="px-captcha" style="width: 310px"></div>
<br/>
<h3 style="margin: 0 auto;">Please make sure your browser supports JavaScript and cookies and
        that you are not blocking them from loading. For more information you can review the Terms of Service and Cookie
        Policy.</h3>
<br/>
<h3 id="block_uuid" style="margin: 0 auto; color: #C00;">Block reference ID: </h3>
<script src="/8FCGYgk4/captcha/captcha.js?a=c&amp;m=0"></script>
<script type="text/javascript">document.getElementById("block_uuid").innerText = "Block reference ID: " + qs("uuid");</script>
</div>
</body>
</html>

{'Cache-Control'：'private，no store，no Cache，must revalidate，proxy revalidate，max age=0'，'Content Encoding'：'gzip'，'Content Type'：'text/html，text/html；charset=utf-8'，'ETag'：'W/“5bae6ca0-97f”“，”上次修改“：”Fri，2018年9月28日18:02:08 GMT“，”服务器“：”nginx“，”接受范围“：”字节，字节“，”年龄“，”0，0“，”内容长度“，”1174“，”日期“：”Sat，2018年9月29日17:03:02 GMT“，”Via“：”1.1清漆“，”连接“：”保持活动状态“，”X-Served-By:”cache-fra19128-FRA“，”X-cache“，”未命中“，”X-cache-Hits“，”0“，”X-Timer“，”S1538240583.834133，VS0，VE107'，'Vary'：'，接受编码'}
违反服务条款
.集装箱{
字体系列：Helvetica、Arial、无衬线字体；
}
窗口。\u pxAppId=“PX8FCGYgk4”；
window.pxJsClientSrc=“/8FCGYgk4/init.js”；
窗口。\u pxFirstPartyEnabled=true；
窗口。pxHostUrl=“/8FCGYgk4/xhr”；
窗口；
功能qs（名称）{
var search=window.location.search；
var rx=new RegExp（“[？&]”+name+”（=（[^&#]*）和|#|$）；
var match=rx.exec（搜索）；
返回match？decodeURIComponent（match[2]。replace（/\+/g，“”））：空；
}
违反服务条款
您的使用已被标记为违反我们的标准。


有关此邮件的查询，请。
出售
查询请访问

如果您认为这是错误的，请在下面单击“我不是机器人”确认您不是机器人
在下面




请确保您的浏览器支持JavaScript、cookies和
您没有阻止它们加载。有关更多信息，您可以查看服务条款和Cookie
政策。


块引用ID：
document.getElementById（“块uuid”）.innerText=“块参考ID:+qs（“uuid”）；

顺便说一句，如果你是学生，你可以注册有限的下载帐户。

这里的问题是URL有防刮保护，可以阻止编程HTML提取

尝试获取完整信息

import requests 
from bs4 import BeautifulSoup

#specify the url
quote_page = 'https://www.bloomberg.com/quote/SPX:IND'
result = requests.get(quote_page)
print (result.headers)
#parse the html using beautiful soup and store in variable `soup`
c = result.content
soup = BeautifulSoup(c,"lxml")

print (soup)

输出

{'Cache-Control': 'private, no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html, text/html; charset=utf-8', 'ETag': 'W/"5bae6ca0-97f"', 'Last-Modified': 'Fri, 28 Sep 2018 18:02:08 GMT', 'Server': 'nginx', 'Accept-Ranges': 'bytes, bytes', 'Age': '0, 0', 'Content-Length': '1174', 'Date': 'Sat, 29 Sep 2018 17:03:02 GMT', 'Via': '1.1 varnish', 'Connection': 'keep-alive', 'X-Served-By': 'cache-fra19128-FRA', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1538240583.834133,VS0,VE107', 'Vary': ', Accept-Encoding'}
<html>
<head>
<title>Terms of Service Violation</title>
<style rel="stylesheet" type="text/css">
        .container {
            font-family: Helvetica, Arial, sans-serif;
        }
    </style>
<script>
        window._pxAppId = "PX8FCGYgk4";
        window._pxJsClientSrc = "/8FCGYgk4/init.js";
        window._pxFirstPartyEnabled = true;
        window._pxHostUrl = "/8FCGYgk4/xhr";
        window._pxreCaptchaTheme = "light";

        function qs(name) {
            var search = window.location.search;
            var rx = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)");
            var match = rx.exec(search);
            return match ? decodeURIComponent(match[2].replace(/\+/g, " ")) : null;
        }
    </script>
</head>
<body>
<div class="container">
<img src="https://www.bloomberg.com/graphics/assets/img/BB-Logo-2line.svg" style="margin-bottom: 40px;" width="310"/>
<h1 class="text-center" style="margin: 0 auto;">Terms of Service Violation</h1>
<p>Your usage has been flagged as a violation of our <a href="http://www.bloomberg.com/tos" rel="noopener noreferrer" target="_blank">terms of service</a>.
    </p>
<p>
        For inquiries related to this message please <a href="http://www.bloomberg.com/feedback">contact support</a>.
        For sales
        inquiries, please visit <a href="http://www.bloomberg.com/professional/request-demo">http://www.bloomberg.com/professional/request-demo</a>
</p>
<h3 style="margin: 0 auto;">
        If you believe this to be in error, please confirm below that you are not a robot by clicking "I'm not a robot"
        below.</h3>
<br/>
<div id="px-captcha" style="width: 310px"></div>
<br/>
<h3 style="margin: 0 auto;">Please make sure your browser supports JavaScript and cookies and
        that you are not blocking them from loading. For more information you can review the Terms of Service and Cookie
        Policy.</h3>
<br/>
<h3 id="block_uuid" style="margin: 0 auto; color: #C00;">Block reference ID: </h3>
<script src="/8FCGYgk4/captcha/captcha.js?a=c&amp;m=0"></script>
<script type="text/javascript">document.getElementById("block_uuid").innerText = "Block reference ID: " + qs("uuid");</script>
</div>
</body>
</html>

{'Cache-Control'：'private，no store，no Cache，must revalidate，proxy revalidate，max age=0'，'Content Encoding'：'gzip'，'Content Type'：'text/html，text/html；charset=utf-8'，'ETag'：'W/“5bae6ca0-97f”“，”上次修改“：”Fri，2018年9月28日18:02:08 GMT“，”服务器“：”nginx“，”接受范围“：”字节，字节“，”年龄“，”0，0“，”内容长度“，”1174“，”日期“：”Sat，2018年9月29日17:03:02 GMT“，”Via“：”1.1清漆“，”连接“：”保持活动状态“，”X-Served-By:”cache-fra19128-FRA“，”X-cache“，”未命中“，”X-cache-Hits“，”0“，”X-Timer“，”S1538240583.834133，VS0，VE107'，'Vary'：'，接受编码'}
违反服务条款
.集装箱{
字体系列：Helvetica、Arial、无衬线字体；
}
窗口。\u pxAppId=“PX8FCGYgk4”；
window.pxJsClientSrc=“/8FCGYgk4/init.js”；
窗口。\u pxFirstPartyEnabled=true；
窗口。pxHostUrl=“/8FCGYgk4/xhr”；
窗口；
功能qs（名称）{
var search=window.location.search；
var rx=new RegExp（“[？&]”+name+”（=（[^&#]*）和|#|$）；
var match=rx.exec（搜索）；
返回match？decodeURIComponent（match[2]。replace（/\+/g，“”））：空；
}
违反服务条款
您的使用已被标记为违反我们的标准。


有关此邮件的查询，请。
出售
查询请访问

如果您认为这是错误的，请在下面单击“我不是机器人”确认您不是机器人
在下面




请确保您的浏览器支持JavaScript、cookies和
您没有阻止它们加载。有关更多信息，您可以查看服务条款和Cookie
政策。


块引用ID：
document.getElementById（“块uuid”）.innerText=“块参考ID:+qs（“uuid”）；

顺便说一下，如果你是学生，你可以注册有限的帐户，在下载方面。

你怎么从

urllib.request

导入而不是直接导入

import urllib2

？我在网上看到它说用它来处理Python 3。你怎么从

urllib.request

导入而不是直接导入

import urllib2

？我在网上看到它说用它来处理Python3.谢谢，我添加了这一点，并得到了一个新错误：

Traceback（最近一次调用）：文件“tester.py”，第22行，在name=name\u box.text.strip（）中，strip（）用于删除起始和结尾的AttributeError:“NoneType”对象没有attri