使用正确的url使用BS进行Python web抓取?

使用正确的url使用BS进行Python web抓取?,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,这是个初学者。到目前为止,我有以下代码: import requests from bs4 import BeautifulSoup logurl = "https://login.flash.co.za/apex/f?p=pwfone:login" posturl = 'https://login.flash.co.za/apex/wwv_flow.accept' with requests.Session() as s: s.headers = {"User-Agent":"Mo

这是个初学者。到目前为止,我有以下代码:

import requests
from bs4 import BeautifulSoup

logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'

with requests.Session() as s:
    s.headers = {"User-Agent":"Mozilla/5.0"}
    res = s.get(logurl)
    soup = BeautifulSoup(res.text,"lxml")

    values = {
        'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
        'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
        'p_instance': soup.select_one("[name='p_instance']")['value'],
        'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
        'p_request': 'LOGIN',
        'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
        'p_t01': 'username',
        'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
        'p_t02': 'password',
        'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
        'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
    }

    r = s.post(posturl, data=values)
    print r.content
logurl
=进行登录的url
postrl
=发布登录数据的表单操作url

但是,当我尝试使用此选项时,内容返回“密码不正确”页面,即使输入正确

当我手动正确登录以查看包含所需数据的正确url页面时,我注意到url实际上是下面列出的位置url(来自chrome工具“网络”,请参见下图),其中包括代码中的
flow\u id
instance
值:

位置:https://login.flash.co.za/apex/f?p=1500:1:9004571425464

请求URL:https://login.flash.co.za/apex/wwv_flow.accept

Referer:https://login.flash.co.za/apex/f?p=pwfone:login

我是否应该尝试“发布”到此url,而不是请求url

编辑1:

  import requests
from bs4 import BeautifulSoup

logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'

with requests.Session() as s:
    s.headers = {
            "Host": "login.flash.co.za",
            "Connection": "keep-alive",
            "Origin": "https://login.flash.co.za",
            "Upgrade-Insecure-Requests": "1",
            "Content-Type": "application/x-www-form-urlencoded",
            "User-Agent": "Mozilla/5.0 (Windows NT x.y; rv:10.0) Gecko/20100101 Firefox/10.0",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Referer": "https://login.flash.co.za/apex/f?p=pwfone:login",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.9",
    }
    res = s.get(logurl)
    soup = BeautifulSoup(res.text,"html.parser")

    values = {
        'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
        'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
        'p_instance': soup.select_one("[name='p_instance']")['value'],
        'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
        'p_request': 'LOGIN',
        'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
        'p_t01': 'solar',
        'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
        'p_t02': 'password',
        'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
        'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
    }

    r = s.post(posturl, data=values)
    print r.content

Fiddler

您发布到的URL是正确的,只需设置以下标题并再次尝试登录即可

headers = {
            "Host": "login.flash.co.za",
            "Connection": "keep-alive",
            "Origin": "https://login.flash.co.za",
            "Upgrade-Insecure-Requests": "1",
            "Content-Type": "application/x-www-form-urlencoded",
            "User-Agent": "Mozilla/5.0 (Windows NT x.y; rv:10.0) Gecko/20100101 Firefox/10.0",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Referer": "https://login.flash.co.za/apex/f?p=pwfone:login",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.9",
}

“p_arg_names”的值相同两次。它应该是两个不同的值。尝试将其作为如下列表传递(完全未经测试的代码,因为我没有用户名或密码):


你试过在发帖前设置推荐人吗?s、 update({'referer':logurl})如果您可以将其添加到我的代码的正确位置,我可以试试吗?或者,
Edit 1
会做同样的事情吗?Edit 1会很好,把它放在所有的标题中不会有什么伤害,但你可能只需要它,如果它是在文章之前破坏它的东西。请参阅
Edit 1
。这就是你想要改变的吗?我仍然收到“密码不正确”的消息…谢谢@Dan Dev您是否同意我下一部分的评论,即使用selenium更好<代码>https://stackoverflow.com/questions/50912466/python-web-scraping-with-requests-after-login这要看情况而定。如果你能在没有Selenium的情况下抓取你想要的URL,那就更好了。硒在很多情况下都有点笨重和过度。如果你能在没有它的情况下提取你所需要的链接,那会更容易更好。如果没有登录到这个网站,恐怕你不会得到太多的帮助。如果我给你登录,你能帮我吗?除了查看登录后的几行数据之外,没有什么可以做的(我希望如此)。还有另一种方法。您可以将“打印(r.content)”显示的内容放入其中并从中取出。第一件事是提取“第一行中的语句”的URL,然后向该URL发出请求,然后发出POST请求以模拟单击“运行语句”按钮。当您说“第一行中的语句”的URL时,是指该页面的URL(登录后),还是指“语句”按钮的href?
import requests
from bs4 import BeautifulSoup

logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'

with requests.Session() as s:
    s.headers = {"User-Agent":"Mozilla/5.0"}
    res = s.get(logurl)
    soup = BeautifulSoup(res.text,"lxml")

    arg_names =[]
    for name in  soup.select("[name='p_arg_names']"):
        arg_names.append(name['value'])

    values = {
        'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
        'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
        'p_instance': soup.select_one("[name='p_instance']")['value'],
        'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
        'p_request': 'LOGIN',
        'p_t01': 'username',
        'p_arg_names': arg_names,
        'p_t02': 'password',
        'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
        'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
    }
    s.headers.update({'Referer': logurl})
    r = s.post(posturl, data=values)
    print (r.content)