Python:如何解析需要登录的网页的HTML?
我正在尝试解析需要登录的网页的HTML。我可以使用以下脚本获取网页的HTML:Python:如何解析需要登录的网页的HTML?,python,html,parsing,beautifulsoup,webpage,Python,Html,Parsing,Beautifulsoup,Webpage,我正在尝试解析需要登录的网页的HTML。我可以使用以下脚本获取网页的HTML: from urllib2 import urlopen from BeautifulSoup import BeautifulSoup import re webpage = urlopen ('https://www.example.com') soup = BeautifulSoup (webpage) print soup #This would print the source of example.com
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re
webpage = urlopen ('https://www.example.com')
soup = BeautifulSoup (webpage)
print soup
#This would print the source of example.com
但是,试图获取我登录的网页的源代码要困难得多。
我试着换了新衣服https://www.example.com“)与(”https://user:pass@com),但我得到了一个无效的URL错误
有人知道我怎么做吗?
提前感谢。您可以尝试向登录表单发送POST请求(带有登录凭据),然后保存收到的cookie并在尝试下载需要登录的页面时提供它。我建议您可以使用Mechanize 在mechanize中,您可以设置浏览器对象,以便处理cookies等 您可以遍历表单和链接。。e、 g
for form in browser.forms():
print form
您可以选择所需的表单并按所需方式填写。Selenium WebDriver()可能适合您的需要。您可以登录到该页面,然后打印HTML的内容。下面是一个例子:
from selenium import webdriver
# initiate
driver = webdriver.Firefox() # initiate a driver, in this case Firefox
driver.get("http://example.com") # go to the url
# locate the login form
username_field = driver.find_element_by_name(...) # get the username field
password_field = driver.find_element_by_name(...) # get the password field
# log in
username_field.send_keys("username") # enter in your username
password_field.send_keys("password") # enter in your password
password_field.submit() # submit it
# print HTML
html = driver.page_source
print html
我们可以使用下面的selenium模块来实现
from selenium.selenium import selenium
from selenium import webdriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import webbrowser
# initiate
my_browser = webdriver.Firefox()
my_browser.get("fill with url of the login page ")
try:
my_browser.implicitly_wait(35)
username_field = my_browser.find_element_by_name(' enter the value of the name attribute')#value of the name attribute in the source code
password_field = my_browser.find_element_by_name('enter the value of the name attribute')
username_field.send_keys("fill_with password")
password_field.send_keys("fill with User_name")
password_field.submit() # submit it
finally:
print 'Look Into the Browser'
除非你告诉我们相关网站如何要求你进行身份验证,否则很难帮助你。如果它使用HTTP基本身份验证,只需在查询中添加一个HTTP头,但如果它要填写表单和验证码,那就完全不同了。试试mechanize:但你需要知道如何登录这是一种方法,但这实际上取决于网站要求你如何进行身份验证。@AndréCaron:然而,这通常适用于任何具有用户界面的网站,除非在需要验证码的特殊情况下,在这种情况下,您没有很多选项(并且网站所有者可能不希望您刮伤您的网站,因此可能存在其他障碍)。这给了我一个错误@David542=>ConnectionResetError:[WinError 10054]远程主机已强制关闭现有连接