用python3.6清理一个站点。我能';无法通过登录页面

用python3.6清理一个站点。我能';无法通过登录页面,python,web-scraping,Python,Web Scraping,网站的html表单代码: <form class="m-t" role="form" method="POST" action=""> <div class="form-group text-left"> <label for="username">Username:</label> <input

网站的html表单代码:

                <form class="m-t" role="form" method="POST" action="">

                <div class="form-group text-left">
                    <label for="username">Username:</label>
                    <input type="text" class="form-control" id="username" name="username" placeholder="" autocomplete="off" required />
                </div>
                <div class="form-group text-left">
                    <label for="password">Password:</label>
                    <input type="password" class="form-control" id="pass" name="pass" placeholder="" autocomplete="off" required />
                </div>

                <input type="hidden" name="token" value="/bGbw4NKFT+Yk11t1bgXYg48G68oUeXcb9N4rQ6cEzE=">
                <button type="submit" name="submit" class="btn btn-primary block full-width m-b">Login</button>

尽管我可能会尝试,但除了登录页面的html代码之外,我无法返回任何其他内容,我不知道下一步要探索的地方是什么,也不知道为什么

url=保存登录页面url的变量,url3=我要刮取的页面

任何帮助都将不胜感激

你试过了吗

首先在浏览器上尝试,观察需要哪些标头,并在请求中发送标头。 标题是标识用户或客户端的重要部分

尝试从不同的IP,可能有人正在观看请求的IP

试试这个例子。这里我使用的是selenium和chrome驱动程序。首先,我从selenium获得cookie,并将其保存在一个文件中以备将来使用,然后我使用保存的cookie请求访问需要登录的页面

from selenium import webdriver
import os
import demjson

# download chromedriver from given location and put at some accessible location and set path
# utl to download chrome driver - https://chromedriver.storage.googleapis.com/index.html?path=2.27/
chrompathforselenium = "/path/chromedriver"

os.environ["webdriver.chrome.driver"]=chrompathforselenium
driver=webdriver.Chrome(executable_path=chrompathforselenium)
driver.set_window_size(1120, 550)

driver.get(url1)

driver.find_element_by_name("username").send_keys(username)
driver.find_element_by_name("pass").send_keys(password)

# you need to find how to access button on the basis of class attribute
# here I am doing on the basis of ID
driver.find_element_by_id("btnid").click()

# set your accessible cookiepath here.
cookiepath = ""

cookies=driver.get_cookies()
getCookies=open(cookiepath, "w+")
getCookies.write(demjson.encode(cookies))
getCookies.close()

readCookie = open(cookiepath, 'r')
cookieString = readCookie.read()
cookie = demjson.decode(cookieString)

headers = {}
# write all the headers
headers.update({"key":"value"})

response = requests.get(url3, headers=headers, cookies=cookie)
# check your response

这是最终起作用的代码:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import demjson
import requests
capabilities = DesiredCapabilities.FIREFOX.copy()
import os
os.chdir('C:\\...') #chdir to the dir with geckodriver.exe in it
driver = webdriver.Firefox(capabilities=capabilities, firefox_binary='C:\\Program Files\\Mozilla Firefox\\firefox.exe')
username = '...'
password = '...'
url = 'https://.../login.php' #login url
url2 = '...' #1st page you want to scrape

driver.get(url)
driver.find_element_by_name("usr").send_keys(username)
driver.find_element_by_name("pwd").send_keys(password)

driver.find_element_by_name("btn_id").click()

s = requests.session()
for cookie in driver.get_cookies():
    c = {cookie['name']: cookie['value']}
    s.cookies.update(c)


response = s.get(url2)

您可能希望使用
fiddler
在登录时捕获所有流量,并找出幕后发生的情况,然后像第一个示例一样模拟该过程,使用
127.0.0.1:8888进行调试,并将您的请求与实际登录请求进行比较,直到您从服务器获得正确的响应。感谢您的响应Shane。我以前从未见过fiddler,你能提供一个链接吗?它是python模块还是其他程序?如果我没弄错的话,它会是吗?我在这台工作机器上没有管理员权限,所以我回家时需要备份。是的,就是这个!谢谢你的回复,邦妮。我现在有了HTTP live标头。我如何知道要在标题中包含哪个部分?我从这里看到:我将添加内容类型:text/html;字符集=utf-8,是否正确?我怀疑这是一个IP问题,因为我是从一台内部计算机上处理公司网站的。这取决于你和谁打交道,以及你提出什么样的请求。例如,对于像银行这样的网站,他们将监视所有内容,如用户代理、内容类型、引用。如果是某种api,则可能需要授权参数。所以这取决于他们想要什么。尝试发送你从浏览器上获得的每个标题。因此,在这个阶段,我有权访问创建该网站的人,这一点可能值得一提。他们一直在学习php,而他们做到了。因此,如果有问题我可以问,而不是试错测试,那么这是一个选项。我曾尝试与他们讨论我试图实现的目标,但他们无法提供帮助,因为他们没有网页抓取的经验。你能在这里写下你从浏览器和登录页面url获得的标题吗?请确保您的请求是正确的。在第一个url(即url=保存登录页面url的变量)之后,是否需要其他url?如果是,那么您还必须调用该url。因为很多网站在登录后会在前端生成令牌和随机数,并将其发送回验证目的。我进入该页面并像通常从Firefox登录一样登录,这是我得到的:
from selenium import webdriver
import os
import demjson

# download chromedriver from given location and put at some accessible location and set path
# utl to download chrome driver - https://chromedriver.storage.googleapis.com/index.html?path=2.27/
chrompathforselenium = "/path/chromedriver"

os.environ["webdriver.chrome.driver"]=chrompathforselenium
driver=webdriver.Chrome(executable_path=chrompathforselenium)
driver.set_window_size(1120, 550)

driver.get(url1)

driver.find_element_by_name("username").send_keys(username)
driver.find_element_by_name("pass").send_keys(password)

# you need to find how to access button on the basis of class attribute
# here I am doing on the basis of ID
driver.find_element_by_id("btnid").click()

# set your accessible cookiepath here.
cookiepath = ""

cookies=driver.get_cookies()
getCookies=open(cookiepath, "w+")
getCookies.write(demjson.encode(cookies))
getCookies.close()

readCookie = open(cookiepath, 'r')
cookieString = readCookie.read()
cookie = demjson.decode(cookieString)

headers = {}
# write all the headers
headers.update({"key":"value"})

response = requests.get(url3, headers=headers, cookies=cookie)
# check your response
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import demjson
import requests
capabilities = DesiredCapabilities.FIREFOX.copy()
import os
os.chdir('C:\\...') #chdir to the dir with geckodriver.exe in it
driver = webdriver.Firefox(capabilities=capabilities, firefox_binary='C:\\Program Files\\Mozilla Firefox\\firefox.exe')
username = '...'
password = '...'
url = 'https://.../login.php' #login url
url2 = '...' #1st page you want to scrape

driver.get(url)
driver.find_element_by_name("usr").send_keys(username)
driver.find_element_by_name("pwd").send_keys(password)

driver.find_element_by_name("btn_id").click()

s = requests.session()
for cookie in driver.get_cookies():
    c = {cookie['name']: cookie['value']}
    s.cookies.update(c)


response = s.get(url2)