Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/78.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python:如何解析需要登录的网页的HTML?_Python_Html_Parsing_Beautifulsoup_Webpage - Fatal编程技术网

Python:如何解析需要登录的网页的HTML?

Python:如何解析需要登录的网页的HTML?,python,html,parsing,beautifulsoup,webpage,Python,Html,Parsing,Beautifulsoup,Webpage,我正在尝试解析需要登录的网页的HTML。我可以使用以下脚本获取网页的HTML: from urllib2 import urlopen from BeautifulSoup import BeautifulSoup import re webpage = urlopen ('https://www.example.com') soup = BeautifulSoup (webpage) print soup #This would print the source of example.com

我正在尝试解析需要登录的网页的HTML。我可以使用以下脚本获取网页的HTML:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage = urlopen ('https://www.example.com')
soup = BeautifulSoup (webpage)
print soup
#This would print the source of example.com
但是,试图获取我登录的网页的源代码要困难得多。 我试着换了新衣服https://www.example.com“)与(”https://user:pass@com),但我得到了一个无效的URL错误

有人知道我怎么做吗?
提前感谢。

您可以尝试向登录表单发送POST请求(带有登录凭据),然后保存收到的cookie并在尝试下载需要登录的页面时提供它。

我建议您可以使用Mechanize

在mechanize中,您可以设置浏览器对象,以便处理cookies等

您可以遍历表单和链接。。e、 g

for form in browser.forms():
   print form
您可以选择所需的表单并按所需方式填写。

Selenium WebDriver()可能适合您的需要。您可以登录到该页面,然后打印HTML的内容。下面是一个例子:

from selenium import webdriver

# initiate
driver = webdriver.Firefox() # initiate a driver, in this case Firefox
driver.get("http://example.com") # go to the url

# locate the login form
username_field = driver.find_element_by_name(...) # get the username field
password_field = driver.find_element_by_name(...) # get the password field

# log in
username_field.send_keys("username") # enter in your username
password_field.send_keys("password") # enter in your password
password_field.submit() # submit it

# print HTML
html = driver.page_source
print html

我们可以使用下面的selenium模块来实现

from selenium.selenium import selenium 
from selenium import webdriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import webbrowser


# initiate
my_browser = webdriver.Firefox()
my_browser.get("fill with url of the login page ")
try: 
    my_browser.implicitly_wait(35)
    username_field = my_browser.find_element_by_name(' enter the value of the name attribute')#value of the name attribute in the source code 
    password_field = my_browser.find_element_by_name('enter the value of the name attribute') 
    username_field.send_keys("fill_with password") 
    password_field.send_keys("fill with User_name")
    password_field.submit() # submit it



finally:

    print 'Look Into the Browser'

除非你告诉我们相关网站如何要求你进行身份验证,否则很难帮助你。如果它使用HTTP基本身份验证,只需在查询中添加一个HTTP头,但如果它要填写表单和验证码,那就完全不同了。试试mechanize:但你需要知道如何登录这是一种方法,但这实际上取决于网站要求你如何进行身份验证。@AndréCaron:然而,这通常适用于任何具有用户界面的网站,除非在需要验证码的特殊情况下,在这种情况下,您没有很多选项(并且网站所有者可能不希望您刮伤您的网站,因此可能存在其他障碍)。这给了我一个错误@David542=>ConnectionResetError:[WinError 10054]远程主机已强制关闭现有连接