正在尝试使用Python3登录到网站_Python_Web Scraping

正在尝试使用Python3登录到网站

python web-scraping

正在尝试使用Python3登录到网站,python,web-scraping,Python,Web Scraping,我是Python新手，所以仍然习惯于它提供的一些不同的库。我目前正在尝试使用urllib访问网站的HTML，这样我最终可以从我想要登录的帐户中的表中获取数据 import urllib.request link = "websiteurl.com" login = "email@address.com" password = "password" #Access the website of the given address, returns back an HTML file def a

我是Python新手，所以仍然习惯于它提供的一些不同的库。我目前正在尝试使用urllib访问网站的HTML，这样我最终可以从我想要登录的帐户中的表中获取数据

import urllib.request

link = "websiteurl.com"
login = "email@address.com"
password = "password"

#Access the website of the given address, returns back an HTML file
def access_website(address):
    return urllib.request.urlopen(address).read()

html = access_website(link)
print(html)

这个函数返回我

b'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    
<meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Festival Manager</title>\n   
 <link href="bundle.css" rel="stylesheet">\n    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->\n   
 <!-- WARNING: Respond.js doesn\'t work if you view the page via file:// -->\n    <!--[if lt IE 9]>\n      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>\n     
 <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>\n    <![endif]-->\n  </head>\n  <body>\n    
<script src="vendor.js"></script>\n    <script src="login.js"></script>\n  </body>\n</html>\n'

b'\n\n\n\n
\n节日管理器\n
\n\n
\n\n\n\n
\n\n\n\n'

问题是我真的不知道为什么它给了我关于HTML5 shim和respond.js的部分。。。因为当我去实际的网站检查javascript时，它看起来不像这样，所以它似乎没有返回我实际访问网站时看到的HTML

另外，当我发送登录信息时，我试图检查它发送了什么样的请求，它没有在inspect元素的network选项卡中显示post请求。因此，我甚至不确定如何通过Python通过post登录请求发送登录信息。登录后，您可以使用BeautifulSoup或任何其他类型的抓取，如果您没有登录3d party库/模块，您也可以进行抓取

同样地

根据StackOverflow准则，整个脚本复制如下：

# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar

def scraper_login():
    ####### change variables here, like URL, action URL, user, pass
    # your base URL here, will be used for headers and such, with and without https://
    base_url = 'www.example.com'
    https_base_url = 'https://' + base_url

    # here goes URL that's found inside form action='.....'
    #   adjust as needed, can be all kinds of weird stuff
    authentication_url = https_base_url + '/login'

    # username and password for login
    username = 'yourusername'
    password = 'SoMePassw0rd!'

    # we will use this string to confirm a login at end
    check_string = 'Logout'

    ####### rest of the script is logic
    # but you will need to tweak couple things maybe regarding "token" logic
    #   (can be _token or token or _token_ or secret ... etc)

    # big thing! you need a referer for most pages! and correct headers are the key
    headers={"Content-Type":"application/x-www-form-urlencoded",
    "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
    "Host":base_url,
    "Origin":https_base_url,
    "Referer":https_base_url}

    # initiate the cookie jar (using : http.cookiejar and urllib.request)
    cookie_jar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
    urllib.request.install_opener(opener)

    # first a simple request, just to get login page and parse out the token
    #       (using : urllib.request)
    request = urllib.request.Request(https_base_url)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # parse the page, we look for token eg. on my page it was something like this:
    #    <input type="hidden" name="_token" value="random1234567890qwertzstring">
    #       this can probably be done better with regex and similar
    #       but I'm newb, so bear with me
    html = contents.decode("utf-8")
    # text just before start and just after end of your token string
    mark_start = '<input type="hidden" name="_token" value="'
    mark_end = '">'
    # index of those two points
    start_index = html.find(mark_start) + len(mark_start)
    end_index = html.find(mark_end, start_index)
    # and text between them is our token, store it for second step of actual login
    token = html[start_index:end_index]

    # here we craft our payload, it's all the form fields, including HIDDEN fields!
    #   that includes token we scraped earler, as that's usually in hidden fields
    #   make sure left side is from "name" attributes of the form,
    #       and right side is what you want to post as "value"
    #   and for hidden fields make sure you replicate the expected answer,
    #       eg. "token" or "yes I agree" checkboxes and such
    payload = {
        '_token':token,
    #    'name':'value',    # make sure this is the format of all additional fields !
        'login':username,
        'password':password
    }

    # now we prepare all we need for login
    #   data - with our payload (user/pass/token) urlencoded and encoded as bytes
    data = urllib.parse.urlencode(payload)
    binary_data = data.encode('UTF-8')
    # and put the URL + encoded data + correct headers into our POST request
    #   btw, despite what I thought it is automatically treated as POST
    #   I guess because of byte encoded data field you don't need to say it like this:
    #       urllib.request.Request(authentication_url, binary_data, headers, method='POST')
    request = urllib.request.Request(authentication_url, binary_data, headers)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # just for kicks, we confirm some element in the page that's secure behind the login
    #   we use a particular string we know only occurs after login,
    #   like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
    contents = contents.decode("utf-8")
    index = contents.find(check_string)
    # if we find it
    if index != -1:
        print(f"We found '{check_string}' at index position : {index}")
    else:
        print(f"String '{check_string}' was not found! Maybe we did not login ?!")

scraper_login()

#仅使用Python 3标准库登录网站
导入urllib.parse
导入urllib.request
导入http.cookiejar
def\u login（）：
#######在此处更改变量，如URL、操作URL、用户、传递
#此处的基本URL将用于标题等，包括或不包括https：//
base_url='www.example.com'
https_base_url='https://'+base_url
#下面是在表单action='…'中找到的URL
#根据需要调整，可以是各种奇怪的东西
身份验证\u url=https\u base\u url+'/login'
#用于登录的用户名和密码
用户名='yourusername'
密码='SoMePassw0rd！'
#我们将使用此字符串在结尾确认登录
检查字符串='Logout'
#######脚本的其余部分是逻辑
#但是你可能需要调整一些关于“令牌”逻辑的东西
#（可以是_token或token或_token或secret…等）
#大人物！大多数页面都需要一个推荐人！正确的标题是关键
headers={“内容类型”：“application/x-www-form-urlencoded”，
“用户代理”：“Mozilla/5.0 Chrome/81.0.4044.92”，#Chrome 80+符合web搜索
“主机”：基本url，
“来源”：https\u base\u url，
“Referer”：https\u base\u url}
#启动cookiejar（使用：http.cookiejar和urllib.request）
cookie\u jar=http.cookiejar.cookiejar（）
opener=urllib.request.build\u opener（urllib.request.HTTPCookieProcessor（cookie\u jar））
urllib.request.install_opener（opener）
#首先是一个简单的请求，只需获取登录页面并解析出令牌
#（使用：urllib.request）
request=urllib.request.request（https\u base\u url）
response=urllib.request.urlopen（请求）
contents=response.read（）
#解析页面，我们寻找令牌，例如，在我的页面上，它是这样的：
#    
#使用regex和类似的工具可能会做得更好
#但我是新手，所以请容忍我
html=内容。解码（“utf-8”）
#标记字符串开始之前和结束之后的文本
mark_start=“”
#这两点的索引
start\u index=html.find（mark\u start）+len（mark\u start）
end\u index=html.find（标记结束、开始索引）
#它们之间的文本是我们的令牌，存储它用于实际登录的第二步
token=html[开始索引：结束索引]
#在这里，我们制作我们的有效载荷，它是所有的表单字段，包括隐藏字段！
#这包括我们提前刮取的代币，通常在隐藏的区域
#确保左侧来自表单的“name”属性，
#右边是你想发布的“价值”
#对于隐藏字段，请确保复制预期答案，
#例如，“代币”或“是，我同意”复选框等
有效载荷={
“_标记”：标记，
#'name'：'value'，#确保这是所有其他字段的格式！
“登录”：用户名，
“密码”：密码
}
#现在我们准备好登录所需的一切
#数据-使用我们的有效负载（用户/通行证/令牌）URL编码并编码为字节
data=urllib.parse.urlencode（有效负载）
二进制数据=数据。编码（'UTF-8'）
#并将URL+编码数据+正确的标题放入我们的POST请求中
#顺便说一句，尽管我认为这是自动处理后
#我猜由于字节编码的数据字段，您不需要这样说：
#request.request（身份验证\u url、二进制\u数据、头、方法='POST'）
request=urllib.request.request（身份验证url、二进制数据、标题）
response=urllib.request.urlopen（请求）
contents=response.read（）
#为了好玩，我们确认页面中的某些元素在登录后是安全的
#我们使用一个特定的字符串，我们知道它只在登录后出现，
#比如“注销”或“欢迎”或“会员”，等等。我发现“注销”到目前为止是相当安全的
内容=内容。解码（“utf-8”）
index=contents.find（检查字符串）
#如果我们找到它
如果索引！=-1:
打印（f“我们在索引位置{index}找到了{check_string}”）
其他：
打印（未找到f“字符串”{check_String}！可能我们没有登录？！”）
(u login)

一个简短的附加信息，关于您的原始代码。。。如果您没有登录页面，这通常就足够了。但在现代登录中，您通常会有cookie、参考页面检查、用户代理代码、令牌，如果不是更多的话（比如验证码）。网站不喜欢被刮，他们会与之抗争。这也叫做良好的安全性

因此，除了像最初那样执行请求之外，您还必须： -获取页面的cookie，并在登录时将其返回 -了解页面的引用，通常您可以将登录页面推到t