Can'；t使用python请求从特定网站进行刮取_Python_Web Scraping_Python Requests

Can'；t使用python请求从特定网站进行刮取

python web-scraping

Can'；t使用python请求从特定网站进行刮取,python,web-scraping,python-requests,Python,Web Scraping,Python Requests,我试图从下面的URL中删除，但它没有删除我使用浏览器访问时看到的内容（来自公共客户案例/故事的内容）。我还试着模拟一个带有标题的真实浏览器，但到目前为止什么都没有。给我小费吗网址：您需要正确处理证书。这将需要额外的一揽子计划： pip install certifi pip install urllib3 我们需要使用不同的python库，即urllib3 python Python 3.7.7 (default, Mar 10 2020, 15:43:33) [Clang 11.0.0

我试图从下面的URL中删除，但它没有删除我使用浏览器访问时看到的内容（来自公共客户案例/故事的内容）。我还试着模拟一个带有标题的真实浏览器，但到目前为止什么都没有。给我小费吗
网址：

您需要正确处理证书。这将需要额外的一揽子计划：

pip install certifi pip install urllib3
我们需要使用不同的python库，即urllib3

python Python 3.7.7 (default, Mar 10 2020, 15:43:33) [Clang 11.0.0 (clang-1100.0.33.17)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import requests >>> import certifi >>> import urllib3 >>> >>> http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where()) >>> main_url = "https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365" >>> >>> r = http.request('GET', main_url) >>> r.status 200 >>> r.data

>打开（“stories.html”、“wb”）。写入（r.data）
输出：

>>> r.data b'\r\n<!doctype html>\r\n<html lang="en" xml:lang="en" dir="ltr">\r\n<head prefix="og: http://ogp.me/ns#">\r\n <meta charset="utf-8" />\r\n <meta name="viewport" content="width=device-width, initial-scale=1.0" />\r\n <meta name="description" content="Microsoft customer stories. See how Microsoft tools help companies run their business.">\r\n <meta name="keywords" content="Microsoft, customers, stories, business, software, tools, services, use case, global, collaboration, vendor, story sear .....

>>r.data b'\r\n\r\n\r\n\r\n\r\n\r\n\r\n它使用外部API获取数据。您只需拨打以下电话： GET https://customers.microsoft.com/en-us/api/search?key=STORY_KEY STORY_KEY 是767633-asos-retailer-azure-active-directory-m365 例如url中最后一个斜杠后的文本。您可以使用如下脚本： import requests url = "https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365" r = requests.get( "https://customers.microsoft.com/en-us/api/search", params = { "key": url.rsplit('/', 1)[1] } ) document = r.json()["search_document"] summary = document["story_exec_summary"] body = document["story_body_text_2"] quote1 = document["story_quote_carousel"] quote2 = document["story_quote_carousel_2"] print(summary) print(body) print(quote1) print(quote2) 请注意，您需要在文档中搜索要查找的数据（视频、body3等）。您能告诉我错误吗？Cz在我的情况下，我可以看到一些内容回敬@DeepBhatt，它没有带来真实的故事。这是微软的一个公共客户故事网站。我得到了很多内容，但不是故事或左边的信息（关于故事的元数据）。谢谢你的回复。我仍然没有从故事中得到内容。我认为它只带来了第一段。下载的内容与页面来源相匹配（右键单击->页面来源），而故事显然不会出现在页面来源中。所以，我的想法是，一旦你有了页面来源，你需要使用一些库来呈现页面，比如“beautifulsoup”，并确定可以下载实际故事内容的链接。谢谢你的提示，Hussain，我也能够验证这一点。我的问题是如何获得真实的故事内容。我将尝试使用你提到的这个模块。哇！！我认为它工作得很好。我可以问一下你是如何找到这个API的吗？非常感谢 import requests url = "https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365" r = requests.get( "https://customers.microsoft.com/en-us/api/search", params = { "key": url.rsplit('/', 1)[1] } ) document = r.json()["search_document"] summary = document["story_exec_summary"] body = document["story_body_text_2"] quote1 = document["story_quote_carousel"] quote2 = document["story_quote_carousel_2"] print(summary) print(body) print(quote1) print(quote2)