Python 我能'；使用requests.get和beautifulsoup无法获取所需的完整html内容_Python_Web Scraping_Beautifulsoup_Python Requests

Python 我能'；使用requests.get和beautifulsoup无法获取所需的完整html内容

python web-scraping

Python 我能'；使用requests.get和beautifulsoup无法获取所需的完整html内容,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,我刚刚学习了如何使用requests.get方法来废弃数据我想得到完整的html代码，如Chrome提供的开发者工具所示然而，由于某种原因，我不能我正在使用python3.x import requests from bs4 import BeautifulSoup url_test = "http://zozo.jp/shop/ryuryu/goods/36213553/?did=62016020" r = requests.get(url_test) r.status_code ht

我刚刚学习了如何使用requests.get方法来废弃数据

我想得到完整的html代码，如Chrome提供的开发者工具所示

然而，由于某种原因，我不能

我正在使用python3.x

import requests
from bs4 import BeautifulSoup

url_test = "http://zozo.jp/shop/ryuryu/goods/36213553/?did=62016020"
r = requests.get(url_test)
r.status_code
html = r.content 
soup = BeautifulSoup(html, 'html.parser') 
print(soup)

我得到的结果如下所示

r.status_code
200

print(soup)
    <!DOCTYPE html>

<html lang="ja">
<head>
<meta charset="utf-8"/>
<title>お知らせ - ZOZOTOWN</title>
<meta content="" name="description"/>
<meta content="ZOZO,ZOZOTOWN,ゾゾ,ゾゾタウン,ぞぞ,ぞぞたうん,ファッション通販,通販,通信販売,ec" name="keywords"/>
<meta content="noindex,nofollow,noydir,noodp" name="robots"/>
<meta content="width=device-width,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no" name="viewport"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="10;URL=http://zozo.jp/" http-equiv="refresh"/>
<link href="assets/favicon.ico" rel="shortcut icon"/>
<link href="assets/base.css" rel="stylesheet"/>
<style>
            .container { margin-bottom:40px; text-align:center; }
            .header-brand { margin-top:25px; margin-bottom:40px; }
            .header-brand-img { width:183px; }
            .text-body { margin-top:25px; margin-bottom:0; line-height:1.846153846; }
            .bow { margin-top:25px; margin-bottom:0; }
            .bow-img { margin-left:10px; width:108px; }
            .info { margin-top:15px; margin-bottom:0; font-size:10px; line-height:1.7; }
            .info-link { color:#27a301; text-decoration:underline; }

            @media (min-width:768px) {
                    .header-brand { margin-top:40px; margin-bottom:55px; }
                    .header-brand-img { width:206px; }
                    .text-body { font-size:16px; line-height:1.9375; }
                    .bow-img { width:147px; }
                    .info { font-size:14px; line-height:1.785714286; }
            }
    </style>
</head>
<body>
<div id="container">
<div class="container">
<h1 class="header-brand">
<img alt="ZOZOTOWN" class="header-brand-img" src="assets/header-brand-logo.png"/>
</h1>
<p class="text-body">
                    平素よりZOZOTOWNを<br/>
                    ご利用いただきありがとうございます。
            </p>
<p class="text-body">
                    現在、サイトが混み合っております。
            </p>
<p class="text-body">
</p>
<p class="text-body">
                    お客様にはご迷惑おかけいたしますが<br/>
                    しばらく時間を置いて再度アクセスして<br/>
                    いただきますようお願いいたします。
            </p>
<p class="text-body">
                    株式会社ZOZO
            </p>
<p class="bow">
<img alt="" class="bow-img" src="assets/bow-img.png"/>
</p>
<p class="info">
<a class="info-link" href="https://line.me/S/sticker/1675710" target="_blank">
                            ZOZOTOWN公式キャラクター 「箱猫マックス」<br/>
                            LINE スタンプ 販売中
                    </a>
</p>
</div>
</div>
</body>
</html>

r.status\u代码
200
印花（汤）
お知らせ - 佐佐敦
.container{页边距底部：40px；文本对齐：中心；}
.header品牌{页边距顶部：25px；页边距底部：40px；}
.header品牌img{宽度：183px；}
.text正文{页边距顶部：25px；页边距底部：0；行高：1.846153846；}
.bow{页边距顶部：25px；页边距底部：0；}
.bow img{左边距：10px；宽度：108px；}
.info{页边距顶部：15px；页边距底部：0；字体大小：10px；行高：1.7；}
.info链接{颜色：#27a301；文本装饰：下划线；}
@介质（最小宽度：768px）{
.header品牌{页边距顶部：40px；页边距底部：55px；}
.标题品牌img{宽度：206px；}
.text正文{字体大小：16px；行高：1.9375；}
.bow img{宽度：147px；}
.info{字体大小：14px；行高：1.785714286；}
}

平素より佐佐敦を

ご利用いただきありがとうございます。


現在、サイトが混み合っております。




お客様にはご迷惑おかけいたしますが

しばらく時間を置いて再度アクセスして

いただきますようお願いいたします。


株式会社佐佐

以上结果与我预期的不同

请访问此页面以查看完整的html（）

请帮忙

多谢各位

用户代理请求标头包含一个特征字符串，该字符串允许网络协议对等方识别请求软件用户代理的应用程序类型、操作系统、软件供应商或软件版本。验证服务器端的用户代理标头是一项常见操作，因此请确保使用有效的浏览器用户代理字符串以避免被阻止

（来源：）

您只需要设置一个合法的用户代理。因此，添加标题以模拟浏览器：

# This is a standard user-agent of Chrome browser running on Windows 10 
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }

例如：

from bs4 import BeautifulSoup
import requests 
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'} 
resp = requests.get('http://zozo.jp/shop/ryuryu/goods/36213553/?did=62016020', headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser') 
print (soup)

此外，您还可以添加另一组标题，使其看起来像合法的浏览器。添加更多类似以下内容的标题：

headers = { 
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
    'Accept-Language' : 'en-US,en;q=0.5', 
    'Accept-Encoding' : 'gzip', 
    'DNT' : '1', # Do Not Track Request Header 
    'Connection' : 'close' 
}