Python 下载内容类型为';的文件:';文本/html和内容编码';:';gzip';

Python 下载内容类型为';的文件:';文本/html和内容编码';:';gzip';,python,python-requests,Python,Python Requests,我正在尝试从url下载压缩文件。我有以下网站信息 r = urllib.request.urlopen(url) 内容如下所示 >>> r.content b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html xmlns="http://www.w3.org/1999/xhtm

我正在尝试从url下载压缩文件。我有以下网站信息

r = urllib.request.urlopen(url)
内容如下所示

>>> r.content
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n<head>\n\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge" />\n\t<link href="https://fonts.googleapis.com/css?family=Barlow:400,400i,500,500i,600,600i,700,700i" rel="stylesheet">\n\t<title>Login - TestRail</title>\n\n\n\t\t\t<link type="text/css" rel="stylesheet" href="https://static.testrail.io/6.2.1.1003/css/auth-modern-combined.css" media="all" />\n\t\n\n<link rel="shortcut icon" href="https://static.testrail.io/6.2.1.1003/images/favicon.ico"/>\n\n\n<script type="text/javascript" src="https://static.testrail.io/6.2.1.1003/js/jquery.js"></script>\n</head>\n<body>\n\n    <script type="text/javascript">\n\t\t\t\t$(document).ready(function(){\n\t\t\t\t\t$(\'#name\').focus();\n\t\t\t\t});\n\t\t\t</script>\n<div id="form" class="loginpage-form">\n    <div class="logo loginpage-logo" >\n        <a href="http://www.gurock.com/testrail/" target="_blank" class="logo-loginpage"></a>\n    </div>\n    <div id="form-inner">\n        <h1 class="loginpage-installationname">TestRail QA</h1><style>\n    input:-webkit-autofill {\n        -webkit-box-shadow: 0 0 0px 1000px white inset;\n    }\n</style>\n<div id="content">\n    <h1 class="loginpage-login-text">Log into Your Account</h1>\n    <br/>\n                                                            <noscript>\n        <div class="loginpage-message-title-hint">\n            <div class="hint-alert"><img src="https://static.testrail.io/6.2.1.1003/images/theme-modern/layout/warning-icon.svg" align="left" height="18" width="16"/>\n                <span class="hint-on-top">Warning!</span></div>\n            <div class="error-text"> Javascript is disabled in your web browser. Please enable Javascript, as Javascript is required to use TestRail.</div>\n        </div>\n    </noscript>\n        \n    <form action="index.php?/auth/login/L3JlcG9ydHMvZ2V0X2h0bWwvMjc0LWQ3N2FlMzQyOGYzOTY2YTNkNWU0MTMxNTkxNmRlMTE3MjFlYTI3OGZmZmUwMDBhNzY1MTBjNzk0NmZjYWQ0NDU:" method="post" >\n    \n    \n                        <div style="min-height:24px;"></div>\n            \n    <div class="form-group"  style=\'padding-bottom:10px\';>\n        <div class=\'login-inputx\'>\n            <input id=\'name\' class="login-input " type=\'text\'\n                   name="name" id="name">\n\n                            <label for=\'name\' class="login-label">Email</label>\n                    </div>\n    </div>\n\n    \n    <div class="form-group" style=\'padding-bottom:10px; margin-top: -9px;\'\' >\n        <div class=\'login-inputx\'>\n            <input id=\'password\' class="login-input "\n                   type=\'password\' name="password" id="password" autocomplete=off>\n            <label for=\'password\' class="login-label">Password</label>\n        </div>\n    </div>\n    <div class=\'display-flex\' style=" margin-bottom:40px;">\n        <div style="float:left;">\n                    </div>\n                    <a href="index.php?/auth/forgot_password"\n               class="loginpage-forgotpassword" style="margin-bottom:10px;">\n                Forgot your password?            </a>\n            </div>\n\n            <label class="loginpage-container">\n            Keep me logged in            <input type="checkbox" checked="checked" id="rememberme" name="rememberme"\n                   value="1" checked="checked"/>\n            <span class="loginpage-checkmark"></span>\n        </label>\n    \n        <button id=\'button_primary\' class="loginpage-button-sso-disable loginpage-button-sso-disable-hover  loginpage-button-sso-disable-active">\n        <span class="single-sign-on"> Log In</span>\n    </button>\n\n    </form>\n    </div>\n\t</div>\n<br/>\n<span class="loginpage-version">v6.2.1.1003</span>\n</div>\n\n\n\t\t\t<script type="text/javascript" src="https://static.testrail.io/6.2.1.1003/js/extensions-combined.js"></script>\n\t\t<script type="text/javascript" src="https://static.testrail.io/6.2.1.1003/js/application-combined.js"></script>\n\t\n<script type="text/javascript">\n$(document).ready(function()\n{\n\t\tApp.Translations.add(\n\t\t"timespans_hour_short",\n\t\t"h"\t);\n\t\tApp.Translations.add(\n\t\t"timespans_minute_short",\n\t\t"m"\t);\n\t\tApp.Translations.add(\n\t\t"timespans_second_short",\n\t\t"s"\t);\n\t});\n</script>\n\n\n</body>\n</html>\n    <script type="text/javascript">\n        var browser = function() {\n            // Return cached result if avalible, else get result then cache it.\n            if (browser.prototype._cachedResult)\n                return browser.prototype._cachedResult;\n        \n            // Opera 8.0+\n            var isOpera = (!!window.opr && !!opr.addons) || !!window.opera || navigator.userAgent.indexOf(\' OPR/\') >= 0;\n        \n            // Firefox 1.0+\n            var isFirefox = typeof InstallTrigger !== \'undefined\';\n        \n            // Safari 3.0+ "[object HTMLElementConstructor]"\n            var isSafari = /constructor/i.test(window.HTMLElement) || (function (p) { return p.toString() === "[object SafariRemoteNotification]"; })(!window[\'safari\'] || safari.pushNotification);\n        \n            // Internet Explorer 6-11\n            var isIE = /*@cc_on!@*/false || !!document.documentMode;\n        \n            // Edge 20+\n            var isEdge = !isIE && !!window.StyleMedia;\n        \n            // Chrome 1+\n            var isChrome = !!window.chrome && !!window.chrome.webstore;\n        \n            // Blink engine detection\n            var isBlink = (isChrome || isOpera) && !!window.CSS;\n        \n            return browser.prototype._cachedResult =\n                isOpera ? \'Opera\' :\n                isFirefox ? \'Firefox\' :\n                isSafari ? \'Safari\' :\n                isChrome ? \'Chrome\' :\n                isIE ? \'IE\' :\n                isEdge ? \'Edge\' :\n                isBlink ? \'Blink\' :\n                "Don\'t know";\n        };\n        \n        $(\'input[type=password]\').val(\'\');\n        if(browser() == \'Edge\'){\n            $("#password").removeAttr("autocomplete");\n        }\n        if(browser() != \'IE\' && browser() != \'Edge\'){\n            $("#password").attr("autocomplete","new-password");\n        }\n\n        $(\'.login-input\').on(\'blur change\',function () {\n\n            var $label, $this, $value;\n            $this = $(this);\n            $label = $this.siblings("login-label");\n            $value = $this.val();\n            $label.removeClass("label-active");\n            if ($value !== "") {\n                return $this.addClass("input-notempty");\n            } else {\n                return $this.removeClass("input-notempty");\n            }\n        });\n    </script>\n'
而且

import wget 
wget.download(url)
两者都返回“r.content”以上的值

当我在浏览器中粘贴上面的链接时,它会下载一个zip文件

因为此页面还包含身份验证。我还尝试使用requests_html模块,并使用下面代码中的现有会话

from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://taxiforsure.testrail.net/index.php?/reports/get_html/274")

但是内容仍然与上面的r.content相同。

数据隐藏在某种身份验证方案之后。由于您的浏览器已通过身份验证(我假设),所以它工作正常,但urllib和wget未通过身份验证,因此它们得到的是请求身份验证的页面

您需要查看这个testrail的文档,并找出是否有一种对实例进行编程访问的官方方式(例如,官方API和API密钥,诸如此类)。如果有,就用这个


如果没有,您可能需要手动(通过urllib和cookiejars等)或使用像scrapy这样的web抓取系统从Python模拟浏览器,我假设访问受身份验证保护的资源是该领域的常见问题。

我认为在下载url中的文件之前需要登录。使用浏览器,我想您的会话已经登录。但是在python中,我相信在下载之前必须先连接到会话。如果未找到帐户,API将返回用于下载的url。我使用的是api返回的url。您可以尝试通过在匿名浏览器中下载此文件来复制python代码的工作方式。我假设在下载文件之前需要先进行身份验证。这个博客帮助解决了这个问题
import wget 
wget.download(url)
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://taxiforsure.testrail.net/index.php?/reports/get_html/274")