如何抓取需要使用python和beautifulsoup登录的网站？_Python_Web Scraping_Beautifulsoup

如何抓取需要使用python和beautifulsoup登录的网站？

python web-scraping

如何抓取需要使用python和beautifulsoup登录的网站？,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,如果我想抓取一个需要先使用密码登录的网站，我如何开始使用python使用beautifulsoup4库抓取它？下面是我为不需要登录的网站所做的 from bs4 import BeautifulSoup import urllib2 url = urllib2.urlopen("http://www.python.org") content = url.read() soup = BeautifulSoup(content) 如何更改代码以适应登录？假设我想要抓取的网

如果我想抓取一个需要先使用密码登录的网站，我如何开始使用python使用beautifulsoup4库抓取它？下面是我为不需要登录的网站所做的

from bs4 import BeautifulSoup    
import urllib2 
url = urllib2.urlopen("http://www.python.org")    
content = url.read()    
soup = BeautifulSoup(content)

如何更改代码以适应登录？假设我想要抓取的网站是一个需要登录的论坛。例如，您可以使用mechanize：

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib ## http.cookiejar in python3

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")

br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()

print br.response().read()

或者urllib-

您可以使用selenium登录并检索页面源代码，然后将其传递给Beauty Soup以提取所需的数据。

如果您选择selenium，则可以执行以下操作：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

# If you want to open Chrome
driver = webdriver.Chrome()
# If you want to open Firefox
driver = webdriver.Firefox()

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("YourUsername")
password.send_keys("YourPassword")
driver.find_element_by_id("submit_btn").click()

但是，如果您坚持只使用BeautifulSoup，那么您可以通过

请求

或

urllib

这样的库来实现这一点。基本上，您所要做的就是

POST

使用URL将数据作为有效负载

import requests
from bs4 import BeautifulSoup

login_url = 'http://example.com/login'
data = {
    'username': 'your_username',
    'password': 'your_password'
}

with requests.Session() as s:
    response = requests.post(login_url , data)
    print(response.text)
    index_page= s.get('http://example.com')
    soup = BeautifulSoup(index_page.text, 'html.parser')
    print(soup.title)

从我的观点来看，有一种更简单的方法可以让您不用

selenium

或

mechanize

或其他第三方工具，尽管它是半自动化的

基本上，当您以正常方式登录站点时，您会使用您的凭据以一种独特的方式识别您自己，并且在此后的每一次交互中都会使用相同的身份，该身份在短时间内存储在

cookies

和

headers

中

您需要做的是在发出http请求时使用相同的

cookies

和

headers

，这样您就可以进入了

要复制这一点，请执行以下步骤：

在浏览器中，打开“开发人员工具”

转到该站点，然后登录

登录后，转到网络选项卡，然后刷新页面
此时，您应该会看到一个请求列表，最上面的一个是实际的站点，这将是我们的重点，因为它包含的数据具有我们可以用于Python和BeautifulSoup的标识，以便对其进行刮取

右键单击站点请求（最上面的一个），将鼠标悬停在
copy
，然后
copy as> 卷曲

像这样：

然后转到将cURL转换为python请求的网站：

获取python代码，并使用生成的
cookies
和
headers
继续抓取

由于未指定Python版本。登录后，像往常一样使用BeautifulSoup或任何其他类型的刮削
同样地
根据StackOverflow准则，整个脚本复制如下：

# Login to website using just Python 3 Standard Library import urllib.parse import urllib.request import http.cookiejar def scraper_login(): ####### change variables here, like URL, action URL, user, pass # your base URL here, will be used for headers and such, with and without https:// base_url = 'www.example.com' https_base_url = 'https://' + base_url # here goes URL that's found inside form action='.....' # adjust as needed, can be all kinds of weird stuff authentication_url = https_base_url + '/login' # username and password for login username = 'yourusername' password = 'SoMePassw0rd!' # we will use this string to confirm a login at end check_string = 'Logout' ####### rest of the script is logic # but you will need to tweak couple things maybe regarding "token" logic # (can be _token or token or _token_ or secret ... etc) # big thing! you need a referer for most pages! and correct headers are the key headers={"Content-Type":"application/x-www-form-urlencoded", "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92", # Chrome 80+ as per web search "Host":base_url, "Origin":https_base_url, "Referer":https_base_url} # initiate the cookie jar (using : http.cookiejar and urllib.request) cookie_jar = http.cookiejar.CookieJar() opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar)) urllib.request.install_opener(opener) # first a simple request, just to get login page and parse out the token # (using : urllib.request) request = urllib.request.Request(https_base_url) response = urllib.request.urlopen(request) contents = response.read() # parse the page, we look for token eg. on my page it was something like this: # <input type="hidden" name="_token" value="random1234567890qwertzstring"> # this can probably be done better with regex and similar # but I'm newb, so bear with me html = contents.decode("utf-8") # text just before start and just after end of your token string mark_start = '<input type="hidden" name="_token" value="' mark_end = '">' # index of those two points start_index = html.find(mark_start) + len(mark_start) end_index = html.find(mark_end, start_index) # and text between them is our token, store it for second step of actual login token = html[start_index:end_index] # here we craft our payload, it's all the form fields, including HIDDEN fields! # that includes token we scraped earler, as that's usually in hidden fields # make sure left side is from "name" attributes of the form, # and right side is what you want to post as "value" # and for hidden fields make sure you replicate the expected answer, # eg. "token" or "yes I agree" checkboxes and such payload = { '_token':token, # 'name':'value', # make sure this is the format of all additional fields ! 'login':username, 'password':password } # now we prepare all we need for login # data - with our payload (user/pass/token) urlencoded and encoded as bytes data = urllib.parse.urlencode(payload) binary_data = data.encode('UTF-8') # and put the URL + encoded data + correct headers into our POST request # btw, despite what I thought it is automatically treated as POST # I guess because of byte encoded data field you don't need to say it like this: # urllib.request.Request(authentication_url, binary_data, headers, method='POST') request = urllib.request.Request(authentication_url, binary_data, headers) response = urllib.request.urlopen(request) contents = response.read() # just for kicks, we confirm some element in the page that's secure behind the login # we use a particular string we know only occurs after login, # like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far contents = contents.decode("utf-8") index = contents.find(check_string) # if we find it if index != -1: print(f"We found '{check_string}' at index position : {index}") else: print(f"String '{check_string}' was not found! Maybe we did not login ?!") scraper_login()

#仅使用Python 3标准库登录网站导入urllib.parse 导入urllib.request 导入http.cookiejar def\u login（）： #######在此处更改变量，如URL、操作URL、用户、传递 #此处的基本URL将用于标题等，包括或不包括https：// base_url='www.example.com' https_base_url='https://'+base_url #下面是在表单action='…'中找到的URL #根据需要调整，可以是各种奇怪的东西身份验证\u url=https\u base\u url+'/login' #用于登录的用户名和密码用户名='yourusername' 密码='SoMePassw0rd！' #我们将使用此字符串在结尾确认登录检查字符串='Logout' #######脚本的其余部分是逻辑 #但是你可能需要调整一些关于“令牌”逻辑的东西 #（可以是_token或token或_token或secret…等） #大人物！大多数页面都需要一个推荐人！正确的标题是关键 headers={“内容类型”：“application/x-www-form-urlencoded”， “用户代理”：“Mozilla/5.0 Chrome/81.0.4044.92”，#Chrome 80+符合web搜索 “主机”：基本url， “来源”：https\u base\u url， “Referer”：https\u base\u url} #启动cookiejar（使用：http.cookiejar和urllib.request） cookie\u jar=http.cookiejar.cookiejar（） opener=urllib.request.build\u opener（urllib.request.HTTPCookieProcessor（cookie\u jar）） urllib.request.install_opener（opener） #首先是一个简单的请求，只需获取登录页面并解析出令牌 #（使用：urllib.request） request=urllib.request.request（https\u base\u url） response=urllib.request.urlopen（请求） contents=response.read（） #解析页面，我们寻找令牌，例如，在我的页面上，它是这样的： # #使用regex和类似的工具可能会做得更好 #但我是新手，所以请容忍我 html=内容。解码（“utf-8”） #标记字符串开始之前和结束之后的文本 mark_start=“” #这两点的索引 start\u index=html.find（mark\u start）+len（mark\u start） end\u index=html.find（标记结束、开始索引） #它们之间的文本是我们的令牌，存储它用于实际登录的第二步 token=html[开始索引：结束索引] #在这里，我们制作我们的有效载荷，它是所有的表单字段，包括隐藏字段！ #这包括我们提前刮取的代币，通常在隐藏的区域 #确保左侧来自表单的“name”属性， #右边是你想发布的“价值” #对于隐藏字段，请确保复制预期答案， #例如，“代币”或“是，我同意”复选框等有效载荷={ “_标记”：标记， #'name'：'value'，#确保这是所有其他字段的格式！ “登录”：用户名， “密码”：密码 } #现在我们准备好登录所需的一切 #数据-使用我们的有效负载（用户/通行证/令牌）URL编码并编码为字节 data=urllib.parse.urlencode（有效负载）二进制数据=数据。编码（'UTF-8'） #并将URL+编码数据+正确的标题放入我们的POST请求中 #顺便说一句，尽管我认为这是自动处理后 #我猜由于字节编码的数据字段，您不需要这样说： #请求（身份验证url，二进制数据）