如何抓取需要使用python和beautifulsoup登录的网站?

如何抓取需要使用python和beautifulsoup登录的网站?,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,如果我想抓取一个需要先使用密码登录的网站,我如何开始使用python使用beautifulsoup4库抓取它?下面是我为不需要登录的网站所做的 from bs4 import BeautifulSoup import urllib2 url = urllib2.urlopen("http://www.python.org") content = url.read() soup = BeautifulSoup(content) 如何更改代码以适应登录?假设我想要抓取的网

如果我想抓取一个需要先使用密码登录的网站,我如何开始使用python使用beautifulsoup4库抓取它?下面是我为不需要登录的网站所做的

from bs4 import BeautifulSoup    
import urllib2 
url = urllib2.urlopen("http://www.python.org")    
content = url.read()    
soup = BeautifulSoup(content)
如何更改代码以适应登录?假设我想要抓取的网站是一个需要登录的论坛。例如,您可以使用mechanize:

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib ## http.cookiejar in python3

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")

br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()

print br.response().read()

或者urllib-

您可以使用selenium登录并检索页面源代码,然后将其传递给Beauty Soup以提取所需的数据。

如果您选择selenium,则可以执行以下操作:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

# If you want to open Chrome
driver = webdriver.Chrome()
# If you want to open Firefox
driver = webdriver.Firefox()

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("YourUsername")
password.send_keys("YourPassword")
driver.find_element_by_id("submit_btn").click()
但是,如果您坚持只使用BeautifulSoup,那么您可以通过
请求
urllib
这样的库来实现这一点。基本上,您所要做的就是
POST
使用URL将数据作为有效负载

import requests
from bs4 import BeautifulSoup

login_url = 'http://example.com/login'
data = {
    'username': 'your_username',
    'password': 'your_password'
}

with requests.Session() as s:
    response = requests.post(login_url , data)
    print(response.text)
    index_page= s.get('http://example.com')
    soup = BeautifulSoup(index_page.text, 'html.parser')
    print(soup.title)

从我的观点来看,有一种更简单的方法可以让您不用
selenium
mechanize
或其他第三方工具,尽管它是半自动化的

基本上,当您以正常方式登录站点时,您会使用您的凭据以一种独特的方式识别您自己,并且在此后的每一次交互中都会使用相同的身份,该身份在短时间内存储在
cookies
headers

您需要做的是在发出http请求时使用相同的
cookies
headers
,这样您就可以进入了

要复制这一点,请执行以下步骤:

  • 在浏览器中,打开“开发人员工具”
  • 转到该站点,然后登录
  • 登录后,转到网络选项卡,然后刷新页面
    此时,您应该会看到一个请求列表,最上面的一个是实际的站点,这将是我们的重点,因为它包含的数据具有我们可以用于Python和BeautifulSoup的标识,以便对其进行刮取
  • 右键单击站点请求(最上面的一个),将鼠标悬停在
    copy
    ,然后
    copy as>
    卷曲

    像这样:
  • 然后转到将cURL转换为python请求的网站:
  • 获取python代码,并使用生成的
    cookies
    headers
    继续抓取

  • 由于未指定Python版本。登录后,像往常一样使用BeautifulSoup或任何其他类型的刮削

    同样地

    根据StackOverflow准则,整个脚本复制如下:

    # Login to website using just Python 3 Standard Library
    import urllib.parse
    import urllib.request
    import http.cookiejar
    
    def scraper_login():
        ####### change variables here, like URL, action URL, user, pass
        # your base URL here, will be used for headers and such, with and without https://
        base_url = 'www.example.com'
        https_base_url = 'https://' + base_url
    
        # here goes URL that's found inside form action='.....'
        #   adjust as needed, can be all kinds of weird stuff
        authentication_url = https_base_url + '/login'
    
        # username and password for login
        username = 'yourusername'
        password = 'SoMePassw0rd!'
    
        # we will use this string to confirm a login at end
        check_string = 'Logout'
    
        ####### rest of the script is logic
        # but you will need to tweak couple things maybe regarding "token" logic
        #   (can be _token or token or _token_ or secret ... etc)
    
        # big thing! you need a referer for most pages! and correct headers are the key
        headers={"Content-Type":"application/x-www-form-urlencoded",
        "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
        "Host":base_url,
        "Origin":https_base_url,
        "Referer":https_base_url}
    
        # initiate the cookie jar (using : http.cookiejar and urllib.request)
        cookie_jar = http.cookiejar.CookieJar()
        opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
        urllib.request.install_opener(opener)
    
        # first a simple request, just to get login page and parse out the token
        #       (using : urllib.request)
        request = urllib.request.Request(https_base_url)
        response = urllib.request.urlopen(request)
        contents = response.read()
    
        # parse the page, we look for token eg. on my page it was something like this:
        #    <input type="hidden" name="_token" value="random1234567890qwertzstring">
        #       this can probably be done better with regex and similar
        #       but I'm newb, so bear with me
        html = contents.decode("utf-8")
        # text just before start and just after end of your token string
        mark_start = '<input type="hidden" name="_token" value="'
        mark_end = '">'
        # index of those two points
        start_index = html.find(mark_start) + len(mark_start)
        end_index = html.find(mark_end, start_index)
        # and text between them is our token, store it for second step of actual login
        token = html[start_index:end_index]
    
        # here we craft our payload, it's all the form fields, including HIDDEN fields!
        #   that includes token we scraped earler, as that's usually in hidden fields
        #   make sure left side is from "name" attributes of the form,
        #       and right side is what you want to post as "value"
        #   and for hidden fields make sure you replicate the expected answer,
        #       eg. "token" or "yes I agree" checkboxes and such
        payload = {
            '_token':token,
        #    'name':'value',    # make sure this is the format of all additional fields !
            'login':username,
            'password':password
        }
    
        # now we prepare all we need for login
        #   data - with our payload (user/pass/token) urlencoded and encoded as bytes
        data = urllib.parse.urlencode(payload)
        binary_data = data.encode('UTF-8')
        # and put the URL + encoded data + correct headers into our POST request
        #   btw, despite what I thought it is automatically treated as POST
        #   I guess because of byte encoded data field you don't need to say it like this:
        #       urllib.request.Request(authentication_url, binary_data, headers, method='POST')
        request = urllib.request.Request(authentication_url, binary_data, headers)
        response = urllib.request.urlopen(request)
        contents = response.read()
    
        # just for kicks, we confirm some element in the page that's secure behind the login
        #   we use a particular string we know only occurs after login,
        #   like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
        contents = contents.decode("utf-8")
        index = contents.find(check_string)
        # if we find it
        if index != -1:
            print(f"We found '{check_string}' at index position : {index}")
        else:
            print(f"String '{check_string}' was not found! Maybe we did not login ?!")
    
    scraper_login()
    
    #仅使用Python 3标准库登录网站
    导入urllib.parse
    导入urllib.request
    导入http.cookiejar
    def\u login():
    #######在此处更改变量,如URL、操作URL、用户、传递
    #此处的基本URL将用于标题等,包括或不包括https://
    base_url='www.example.com'
    https_base_url='https://'+base_url
    #下面是在表单action='…'中找到的URL
    #根据需要调整,可以是各种奇怪的东西
    身份验证\u url=https\u base\u url+'/login'
    #用于登录的用户名和密码
    用户名='yourusername'
    密码='SoMePassw0rd!'
    #我们将使用此字符串在结尾确认登录
    检查字符串='Logout'
    #######脚本的其余部分是逻辑
    #但是你可能需要调整一些关于“令牌”逻辑的东西
    #(可以是_token或token或_token或secret…等)
    #大人物!大多数页面都需要一个推荐人!正确的标题是关键
    headers={“内容类型”:“application/x-www-form-urlencoded”,
    “用户代理”:“Mozilla/5.0 Chrome/81.0.4044.92”,#Chrome 80+符合web搜索
    “主机”:基本url,
    “来源”:https\u base\u url,
    “Referer”:https\u base\u url}
    #启动cookiejar(使用:http.cookiejar和urllib.request)
    cookie\u jar=http.cookiejar.cookiejar()
    opener=urllib.request.build\u opener(urllib.request.HTTPCookieProcessor(cookie\u jar))
    urllib.request.install_opener(opener)
    #首先是一个简单的请求,只需获取登录页面并解析出令牌
    #(使用:urllib.request)
    request=urllib.request.request(https\u base\u url)
    response=urllib.request.urlopen(请求)
    contents=response.read()
    #解析页面,我们寻找令牌,例如,在我的页面上,它是这样的:
    #    
    #使用regex和类似的工具可能会做得更好
    #但我是新手,所以请容忍我
    html=内容。解码(“utf-8”)
    #标记字符串开始之前和结束之后的文本
    mark_start=“”
    #这两点的索引
    start\u index=html.find(mark\u start)+len(mark\u start)
    end\u index=html.find(标记结束、开始索引)
    #它们之间的文本是我们的令牌,存储它用于实际登录的第二步
    token=html[开始索引:结束索引]
    #在这里,我们制作我们的有效载荷,它是所有的表单字段,包括隐藏字段!
    #这包括我们提前刮取的代币,通常在隐藏的区域
    #确保左侧来自表单的“name”属性,
    #右边是你想发布的“价值”
    #对于隐藏字段,请确保复制预期答案,
    #例如,“代币”或“是,我同意”复选框等
    有效载荷={
    “_标记”:标记,
    #'name':'value',#确保这是所有其他字段的格式!
    “登录”:用户名,
    “密码”:密码
    }
    #现在我们准备好登录所需的一切
    #数据-使用我们的有效负载(用户/通行证/令牌)URL编码并编码为字节
    data=urllib.parse.urlencode(有效负载)
    二进制数据=数据。编码('UTF-8')
    #并将URL+编码数据+正确的标题放入我们的POST请求中
    #顺便说一句,尽管我认为这是自动处理后
    #我猜由于字节编码的数据字段,您不需要这样说:
    #请求(身份验证url,二进制数据)