如何抓取需要使用python和beautifulsoup登录的网站?
如果我想抓取一个需要先使用密码登录的网站,我如何开始使用python使用beautifulsoup4库抓取它?下面是我为不需要登录的网站所做的如何抓取需要使用python和beautifulsoup登录的网站?,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,如果我想抓取一个需要先使用密码登录的网站,我如何开始使用python使用beautifulsoup4库抓取它?下面是我为不需要登录的网站所做的 from bs4 import BeautifulSoup import urllib2 url = urllib2.urlopen("http://www.python.org") content = url.read() soup = BeautifulSoup(content) 如何更改代码以适应登录?假设我想要抓取的网
from bs4 import BeautifulSoup
import urllib2
url = urllib2.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
如何更改代码以适应登录?假设我想要抓取的网站是一个需要登录的论坛。例如,您可以使用mechanize:
import mechanize
from bs4 import BeautifulSoup
import urllib2
import cookielib ## http.cookiejar in python3
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")
br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()
print br.response().read()
或者urllib-您可以使用selenium登录并检索页面源代码,然后将其传递给Beauty Soup以提取所需的数据。如果您选择selenium,则可以执行以下操作:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
# If you want to open Chrome
driver = webdriver.Chrome()
# If you want to open Firefox
driver = webdriver.Firefox()
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("YourUsername")
password.send_keys("YourPassword")
driver.find_element_by_id("submit_btn").click()
但是,如果您坚持只使用BeautifulSoup,那么您可以通过请求
或urllib
这样的库来实现这一点。基本上,您所要做的就是POST
使用URL将数据作为有效负载
import requests
from bs4 import BeautifulSoup
login_url = 'http://example.com/login'
data = {
'username': 'your_username',
'password': 'your_password'
}
with requests.Session() as s:
response = requests.post(login_url , data)
print(response.text)
index_page= s.get('http://example.com')
soup = BeautifulSoup(index_page.text, 'html.parser')
print(soup.title)
从我的观点来看,有一种更简单的方法可以让您不用
selenium
或mechanize
或其他第三方工具,尽管它是半自动化的
基本上,当您以正常方式登录站点时,您会使用您的凭据以一种独特的方式识别您自己,并且在此后的每一次交互中都会使用相同的身份,该身份在短时间内存储在cookies
和headers
中
您需要做的是在发出http请求时使用相同的cookies
和headers
,这样您就可以进入了
要复制这一点,请执行以下步骤:
此时,您应该会看到一个请求列表,最上面的一个是实际的站点,这将是我们的重点,因为它包含的数据具有我们可以用于Python和BeautifulSoup的标识,以便对其进行刮取
copy
,然后copy as>
卷曲
像这样:
cookies
和headers
继续抓取由于未指定Python版本。登录后,像往常一样使用BeautifulSoup或任何其他类型的刮削 同样地 根据StackOverflow准则,整个脚本复制如下:
# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar
def scraper_login():
####### change variables here, like URL, action URL, user, pass
# your base URL here, will be used for headers and such, with and without https://
base_url = 'www.example.com'
https_base_url = 'https://' + base_url
# here goes URL that's found inside form action='.....'
# adjust as needed, can be all kinds of weird stuff
authentication_url = https_base_url + '/login'
# username and password for login
username = 'yourusername'
password = 'SoMePassw0rd!'
# we will use this string to confirm a login at end
check_string = 'Logout'
####### rest of the script is logic
# but you will need to tweak couple things maybe regarding "token" logic
# (can be _token or token or _token_ or secret ... etc)
# big thing! you need a referer for most pages! and correct headers are the key
headers={"Content-Type":"application/x-www-form-urlencoded",
"User-agent":"Mozilla/5.0 Chrome/81.0.4044.92", # Chrome 80+ as per web search
"Host":base_url,
"Origin":https_base_url,
"Referer":https_base_url}
# initiate the cookie jar (using : http.cookiejar and urllib.request)
cookie_jar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
urllib.request.install_opener(opener)
# first a simple request, just to get login page and parse out the token
# (using : urllib.request)
request = urllib.request.Request(https_base_url)
response = urllib.request.urlopen(request)
contents = response.read()
# parse the page, we look for token eg. on my page it was something like this:
# <input type="hidden" name="_token" value="random1234567890qwertzstring">
# this can probably be done better with regex and similar
# but I'm newb, so bear with me
html = contents.decode("utf-8")
# text just before start and just after end of your token string
mark_start = '<input type="hidden" name="_token" value="'
mark_end = '">'
# index of those two points
start_index = html.find(mark_start) + len(mark_start)
end_index = html.find(mark_end, start_index)
# and text between them is our token, store it for second step of actual login
token = html[start_index:end_index]
# here we craft our payload, it's all the form fields, including HIDDEN fields!
# that includes token we scraped earler, as that's usually in hidden fields
# make sure left side is from "name" attributes of the form,
# and right side is what you want to post as "value"
# and for hidden fields make sure you replicate the expected answer,
# eg. "token" or "yes I agree" checkboxes and such
payload = {
'_token':token,
# 'name':'value', # make sure this is the format of all additional fields !
'login':username,
'password':password
}
# now we prepare all we need for login
# data - with our payload (user/pass/token) urlencoded and encoded as bytes
data = urllib.parse.urlencode(payload)
binary_data = data.encode('UTF-8')
# and put the URL + encoded data + correct headers into our POST request
# btw, despite what I thought it is automatically treated as POST
# I guess because of byte encoded data field you don't need to say it like this:
# urllib.request.Request(authentication_url, binary_data, headers, method='POST')
request = urllib.request.Request(authentication_url, binary_data, headers)
response = urllib.request.urlopen(request)
contents = response.read()
# just for kicks, we confirm some element in the page that's secure behind the login
# we use a particular string we know only occurs after login,
# like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
contents = contents.decode("utf-8")
index = contents.find(check_string)
# if we find it
if index != -1:
print(f"We found '{check_string}' at index position : {index}")
else:
print(f"String '{check_string}' was not found! Maybe we did not login ?!")
scraper_login()
#仅使用Python 3标准库登录网站
导入urllib.parse
导入urllib.request
导入http.cookiejar
def\u login():
#######在此处更改变量,如URL、操作URL、用户、传递
#此处的基本URL将用于标题等,包括或不包括https://
base_url='www.example.com'
https_base_url='https://'+base_url
#下面是在表单action='…'中找到的URL
#根据需要调整,可以是各种奇怪的东西
身份验证\u url=https\u base\u url+'/login'
#用于登录的用户名和密码
用户名='yourusername'
密码='SoMePassw0rd!'
#我们将使用此字符串在结尾确认登录
检查字符串='Logout'
#######脚本的其余部分是逻辑
#但是你可能需要调整一些关于“令牌”逻辑的东西
#(可以是_token或token或_token或secret…等)
#大人物!大多数页面都需要一个推荐人!正确的标题是关键
headers={“内容类型”:“application/x-www-form-urlencoded”,
“用户代理”:“Mozilla/5.0 Chrome/81.0.4044.92”,#Chrome 80+符合web搜索
“主机”:基本url,
“来源”:https\u base\u url,
“Referer”:https\u base\u url}
#启动cookiejar(使用:http.cookiejar和urllib.request)
cookie\u jar=http.cookiejar.cookiejar()
opener=urllib.request.build\u opener(urllib.request.HTTPCookieProcessor(cookie\u jar))
urllib.request.install_opener(opener)
#首先是一个简单的请求,只需获取登录页面并解析出令牌
#(使用:urllib.request)
request=urllib.request.request(https\u base\u url)
response=urllib.request.urlopen(请求)
contents=response.read()
#解析页面,我们寻找令牌,例如,在我的页面上,它是这样的:
#
#使用regex和类似的工具可能会做得更好
#但我是新手,所以请容忍我
html=内容。解码(“utf-8”)
#标记字符串开始之前和结束之后的文本
mark_start=“”
#这两点的索引
start\u index=html.find(mark\u start)+len(mark\u start)
end\u index=html.find(标记结束、开始索引)
#它们之间的文本是我们的令牌,存储它用于实际登录的第二步
token=html[开始索引:结束索引]
#在这里,我们制作我们的有效载荷,它是所有的表单字段,包括隐藏字段!
#这包括我们提前刮取的代币,通常在隐藏的区域
#确保左侧来自表单的“name”属性,
#右边是你想发布的“价值”
#对于隐藏字段,请确保复制预期答案,
#例如,“代币”或“是,我同意”复选框等
有效载荷={
“_标记”:标记,
#'name':'value',#确保这是所有其他字段的格式!
“登录”:用户名,
“密码”:密码
}
#现在我们准备好登录所需的一切
#数据-使用我们的有效负载(用户/通行证/令牌)URL编码并编码为字节
data=urllib.parse.urlencode(有效负载)
二进制数据=数据。编码('UTF-8')
#并将URL+编码数据+正确的标题放入我们的POST请求中
#顺便说一句,尽管我认为这是自动处理后
#我猜由于字节编码的数据字段,您不需要这样说:
#请求(身份验证url,二进制数据)