正在尝试使用Python3登录到网站

正在尝试使用Python3登录到网站,python,web-scraping,Python,Web Scraping,我是Python新手,所以仍然习惯于它提供的一些不同的库。我目前正在尝试使用urllib访问网站的HTML,这样我最终可以从我想要登录的帐户中的表中获取数据 import urllib.request link = "websiteurl.com" login = "email@address.com" password = "password" #Access the website of the given address, returns back an HTML file def a

我是Python新手,所以仍然习惯于它提供的一些不同的库。我目前正在尝试使用urllib访问网站的HTML,这样我最终可以从我想要登录的帐户中的表中获取数据

import urllib.request

link = "websiteurl.com"
login = "email@address.com"
password = "password"

#Access the website of the given address, returns back an HTML file
def access_website(address):
    return urllib.request.urlopen(address).read()

html = access_website(link)
print(html)
这个函数返回我

b'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    
<meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Festival Manager</title>\n   
 <link href="bundle.css" rel="stylesheet">\n    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->\n   
 <!-- WARNING: Respond.js doesn\'t work if you view the page via file:// -->\n    <!--[if lt IE 9]>\n      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>\n     
 <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>\n    <![endif]-->\n  </head>\n  <body>\n    
<script src="vendor.js"></script>\n    <script src="login.js"></script>\n  </body>\n</html>\n'
b'\n\n\n\n
\n节日管理器\n
\n\n
\n\n\n\n
\n\n\n\n'
问题是我真的不知道为什么它给了我关于HTML5 shim和respond.js的部分。。。因为当我去实际的网站检查javascript时,它看起来不像这样,所以它似乎没有返回我实际访问网站时看到的HTML

另外,当我发送登录信息时,我试图检查它发送了什么样的请求,它没有在inspect元素的network选项卡中显示post请求。因此,我甚至不确定如何通过Python通过post登录请求发送登录信息。登录后,您可以使用BeautifulSoup或任何其他类型的抓取,如果您没有登录3d party库/模块,您也可以进行抓取

同样地

根据StackOverflow准则,整个脚本复制如下:

# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar

def scraper_login():
    ####### change variables here, like URL, action URL, user, pass
    # your base URL here, will be used for headers and such, with and without https://
    base_url = 'www.example.com'
    https_base_url = 'https://' + base_url

    # here goes URL that's found inside form action='.....'
    #   adjust as needed, can be all kinds of weird stuff
    authentication_url = https_base_url + '/login'

    # username and password for login
    username = 'yourusername'
    password = 'SoMePassw0rd!'

    # we will use this string to confirm a login at end
    check_string = 'Logout'

    ####### rest of the script is logic
    # but you will need to tweak couple things maybe regarding "token" logic
    #   (can be _token or token or _token_ or secret ... etc)

    # big thing! you need a referer for most pages! and correct headers are the key
    headers={"Content-Type":"application/x-www-form-urlencoded",
    "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
    "Host":base_url,
    "Origin":https_base_url,
    "Referer":https_base_url}

    # initiate the cookie jar (using : http.cookiejar and urllib.request)
    cookie_jar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
    urllib.request.install_opener(opener)

    # first a simple request, just to get login page and parse out the token
    #       (using : urllib.request)
    request = urllib.request.Request(https_base_url)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # parse the page, we look for token eg. on my page it was something like this:
    #    <input type="hidden" name="_token" value="random1234567890qwertzstring">
    #       this can probably be done better with regex and similar
    #       but I'm newb, so bear with me
    html = contents.decode("utf-8")
    # text just before start and just after end of your token string
    mark_start = '<input type="hidden" name="_token" value="'
    mark_end = '">'
    # index of those two points
    start_index = html.find(mark_start) + len(mark_start)
    end_index = html.find(mark_end, start_index)
    # and text between them is our token, store it for second step of actual login
    token = html[start_index:end_index]

    # here we craft our payload, it's all the form fields, including HIDDEN fields!
    #   that includes token we scraped earler, as that's usually in hidden fields
    #   make sure left side is from "name" attributes of the form,
    #       and right side is what you want to post as "value"
    #   and for hidden fields make sure you replicate the expected answer,
    #       eg. "token" or "yes I agree" checkboxes and such
    payload = {
        '_token':token,
    #    'name':'value',    # make sure this is the format of all additional fields !
        'login':username,
        'password':password
    }

    # now we prepare all we need for login
    #   data - with our payload (user/pass/token) urlencoded and encoded as bytes
    data = urllib.parse.urlencode(payload)
    binary_data = data.encode('UTF-8')
    # and put the URL + encoded data + correct headers into our POST request
    #   btw, despite what I thought it is automatically treated as POST
    #   I guess because of byte encoded data field you don't need to say it like this:
    #       urllib.request.Request(authentication_url, binary_data, headers, method='POST')
    request = urllib.request.Request(authentication_url, binary_data, headers)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # just for kicks, we confirm some element in the page that's secure behind the login
    #   we use a particular string we know only occurs after login,
    #   like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
    contents = contents.decode("utf-8")
    index = contents.find(check_string)
    # if we find it
    if index != -1:
        print(f"We found '{check_string}' at index position : {index}")
    else:
        print(f"String '{check_string}' was not found! Maybe we did not login ?!")

scraper_login()
#仅使用Python 3标准库登录网站
导入urllib.parse
导入urllib.request
导入http.cookiejar
def\u login():
#######在此处更改变量,如URL、操作URL、用户、传递
#此处的基本URL将用于标题等,包括或不包括https://
base_url='www.example.com'
https_base_url='https://'+base_url
#下面是在表单action='…'中找到的URL
#根据需要调整,可以是各种奇怪的东西
身份验证\u url=https\u base\u url+'/login'
#用于登录的用户名和密码
用户名='yourusername'
密码='SoMePassw0rd!'
#我们将使用此字符串在结尾确认登录
检查字符串='Logout'
#######脚本的其余部分是逻辑
#但是你可能需要调整一些关于“令牌”逻辑的东西
#(可以是_token或token或_token或secret…等)
#大人物!大多数页面都需要一个推荐人!正确的标题是关键
headers={“内容类型”:“application/x-www-form-urlencoded”,
“用户代理”:“Mozilla/5.0 Chrome/81.0.4044.92”,#Chrome 80+符合web搜索
“主机”:基本url,
“来源”:https\u base\u url,
“Referer”:https\u base\u url}
#启动cookiejar(使用:http.cookiejar和urllib.request)
cookie\u jar=http.cookiejar.cookiejar()
opener=urllib.request.build\u opener(urllib.request.HTTPCookieProcessor(cookie\u jar))
urllib.request.install_opener(opener)
#首先是一个简单的请求,只需获取登录页面并解析出令牌
#(使用:urllib.request)
request=urllib.request.request(https\u base\u url)
response=urllib.request.urlopen(请求)
contents=response.read()
#解析页面,我们寻找令牌,例如,在我的页面上,它是这样的:
#    
#使用regex和类似的工具可能会做得更好
#但我是新手,所以请容忍我
html=内容。解码(“utf-8”)
#标记字符串开始之前和结束之后的文本
mark_start=“”
#这两点的索引
start\u index=html.find(mark\u start)+len(mark\u start)
end\u index=html.find(标记结束、开始索引)
#它们之间的文本是我们的令牌,存储它用于实际登录的第二步
token=html[开始索引:结束索引]
#在这里,我们制作我们的有效载荷,它是所有的表单字段,包括隐藏字段!
#这包括我们提前刮取的代币,通常在隐藏的区域
#确保左侧来自表单的“name”属性,
#右边是你想发布的“价值”
#对于隐藏字段,请确保复制预期答案,
#例如,“代币”或“是,我同意”复选框等
有效载荷={
“_标记”:标记,
#'name':'value',#确保这是所有其他字段的格式!
“登录”:用户名,
“密码”:密码
}
#现在我们准备好登录所需的一切
#数据-使用我们的有效负载(用户/通行证/令牌)URL编码并编码为字节
data=urllib.parse.urlencode(有效负载)
二进制数据=数据。编码('UTF-8')
#并将URL+编码数据+正确的标题放入我们的POST请求中
#顺便说一句,尽管我认为这是自动处理后
#我猜由于字节编码的数据字段,您不需要这样说:
#request.request(身份验证\u url、二进制\u数据、头、方法='POST')
request=urllib.request.request(身份验证url、二进制数据、标题)
response=urllib.request.urlopen(请求)
contents=response.read()
#为了好玩,我们确认页面中的某些元素在登录后是安全的
#我们使用一个特定的字符串,我们知道它只在登录后出现,
#比如“注销”或“欢迎”或“会员”,等等。我发现“注销”到目前为止是相当安全的
内容=内容。解码(“utf-8”)
index=contents.find(检查字符串)
#如果我们找到它
如果索引!=-1:
打印(f“我们在索引位置{index}找到了{check_string}”)
其他:
打印(未找到f“字符串”{check_String}!可能我们没有登录?!”)
(u login)
一个简短的附加信息,关于您的原始代码。。。 如果您没有登录页面,这通常就足够了。但在现代登录中,您通常会有cookie、参考页面检查、用户代理代码、令牌,如果不是更多的话(比如验证码)。网站不喜欢被刮,他们会与之抗争。这也叫做良好的安全性

因此,除了像最初那样执行请求之外,您还必须: -获取页面的cookie,并在登录时将其返回 -了解页面的引用,通常您可以将登录页面推到t