使用正确的url使用BS进行Python web抓取?
这是个初学者。到目前为止,我有以下代码:使用正确的url使用BS进行Python web抓取?,python,web-scraping,beautifulsoup,python-requests,Python,Web Scraping,Beautifulsoup,Python Requests,这是个初学者。到目前为止,我有以下代码: import requests from bs4 import BeautifulSoup logurl = "https://login.flash.co.za/apex/f?p=pwfone:login" posturl = 'https://login.flash.co.za/apex/wwv_flow.accept' with requests.Session() as s: s.headers = {"User-Agent":"Mo
import requests
from bs4 import BeautifulSoup
logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'
with requests.Session() as s:
s.headers = {"User-Agent":"Mozilla/5.0"}
res = s.get(logurl)
soup = BeautifulSoup(res.text,"lxml")
values = {
'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
'p_instance': soup.select_one("[name='p_instance']")['value'],
'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
'p_request': 'LOGIN',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t01': 'username',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t02': 'password',
'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
}
r = s.post(posturl, data=values)
print r.content
logurl
=进行登录的url
postrl
=发布登录数据的表单操作url
但是,当我尝试使用此选项时,内容返回“密码不正确”页面,即使输入正确
当我手动正确登录以查看包含所需数据的正确url页面时,我注意到url实际上是下面列出的位置url(来自chrome工具“网络”,请参见下图),其中包括代码中的flow\u id
和instance
值:
位置:https://login.flash.co.za/apex/f?p=1500:1:9004571425464
请求URL:https://login.flash.co.za/apex/wwv_flow.accept
Referer:https://login.flash.co.za/apex/f?p=pwfone:login
我是否应该尝试“发布”到此url,而不是请求url
编辑1:
import requests
from bs4 import BeautifulSoup
logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'
with requests.Session() as s:
s.headers = {
"Host": "login.flash.co.za",
"Connection": "keep-alive",
"Origin": "https://login.flash.co.za",
"Upgrade-Insecure-Requests": "1",
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT x.y; rv:10.0) Gecko/20100101 Firefox/10.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer": "https://login.flash.co.za/apex/f?p=pwfone:login",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
}
res = s.get(logurl)
soup = BeautifulSoup(res.text,"html.parser")
values = {
'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
'p_instance': soup.select_one("[name='p_instance']")['value'],
'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
'p_request': 'LOGIN',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t01': 'solar',
'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
'p_t02': 'password',
'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
}
r = s.post(posturl, data=values)
print r.content
在
Fiddler
您发布到的URL是正确的,只需设置以下标题并再次尝试登录即可
headers = {
"Host": "login.flash.co.za",
"Connection": "keep-alive",
"Origin": "https://login.flash.co.za",
"Upgrade-Insecure-Requests": "1",
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT x.y; rv:10.0) Gecko/20100101 Firefox/10.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Referer": "https://login.flash.co.za/apex/f?p=pwfone:login",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
}
“p_arg_names”的值相同两次。它应该是两个不同的值。尝试将其作为如下列表传递(完全未经测试的代码,因为我没有用户名或密码):
你试过在发帖前设置推荐人吗?s、 update({'referer':logurl})如果您可以将其添加到我的代码的正确位置,我可以试试吗?或者,
Edit 1
会做同样的事情吗?Edit 1会很好,把它放在所有的标题中不会有什么伤害,但你可能只需要它,如果它是在文章之前破坏它的东西。请参阅Edit 1
。这就是你想要改变的吗?我仍然收到“密码不正确”的消息…谢谢@Dan Dev您是否同意我下一部分的评论,即使用selenium更好<代码>https://stackoverflow.com/questions/50912466/python-web-scraping-with-requests-after-login这要看情况而定。如果你能在没有Selenium的情况下抓取你想要的URL,那就更好了。硒在很多情况下都有点笨重和过度。如果你能在没有它的情况下提取你所需要的链接,那会更容易更好。如果没有登录到这个网站,恐怕你不会得到太多的帮助。如果我给你登录,你能帮我吗?除了查看登录后的几行数据之外,没有什么可以做的(我希望如此)。还有另一种方法。您可以将“打印(r.content)”显示的内容放入其中并从中取出。第一件事是提取“第一行中的语句”的URL,然后向该URL发出请求,然后发出POST请求以模拟单击“运行语句”按钮。当您说“第一行中的语句”的URL时,是指该页面的URL(登录后),还是指“语句”按钮的href?
import requests
from bs4 import BeautifulSoup
logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'
with requests.Session() as s:
s.headers = {"User-Agent":"Mozilla/5.0"}
res = s.get(logurl)
soup = BeautifulSoup(res.text,"lxml")
arg_names =[]
for name in soup.select("[name='p_arg_names']"):
arg_names.append(name['value'])
values = {
'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
'p_instance': soup.select_one("[name='p_instance']")['value'],
'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
'p_request': 'LOGIN',
'p_t01': 'username',
'p_arg_names': arg_names,
'p_t02': 'password',
'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
}
s.headers.update({'Referer': logurl})
r = s.post(posturl, data=values)
print (r.content)