如何在python中下载登录表单后面的网页的大媒体链接？_Python_Web Scraping

如何在python中下载登录表单后面的网页的大媒体链接？

python web-scraping

如何在python中下载登录表单后面的网页的大媒体链接？,python,web-scraping,Python,Web Scraping,我正在寻找Python中的一些库：网站上的日志， b找到一些媒体文件的所有链接，比如说在它们的URL中有下载，以及 c将每个文件高效地直接下载到硬盘，而无需将整个媒体文件加载到RAM中谢谢您可以使用mechanize模块登录以下网站： import mechanize br = mechanize.Browser() br.set_handle_robots(False) br.open("http://www.example.com") br.select_form(nr=0) #Pa

我正在寻找Python中的一些库：

网站上的日志， b找到一些媒体文件的所有链接，比如说在它们的URL中有下载，以及 c将每个文件高效地直接下载到硬盘，而无需将整个媒体文件加载到RAM中

谢谢

您可以使用mechanize模块登录以下网站：

import mechanize

br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0)  #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()

使用bs4解析此响应并查找页面中的所有超链接，如下所示：

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(result, "lxml")

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

您可以使用re从响应网页中的所有链接中进一步缩小所需的链接范围，在您的案例中，这些链接是媒体链接.mp3、.mp4、.jpg等

最后，使用请求模块对媒体文件进行流式传输，这样它们就不会像这样占用太多内存：

response = requests.get(url, stream=True)  #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
    if chunk:  # filter out keep-alive new chunks
        handle.write(chunk)
handle.close()

当get的stream属性设置为True时，内容不会立即开始下载到RAM，相反，响应的行为类似于iterable，您可以在get语句之后的循环中以chunk_大小的块进行迭代。在继续下一个块之前，您可以将上一个块写入内存，从而确保数据不存储在RAM中

如果要下载链接列表中每个链接的媒体，必须将最后一段代码放入循环中

您可能需要对该代码进行一些更改才能使其正常工作，因为我自己还没有针对您的用例对其进行测试，但希望这能为您提供一个蓝图。

您可以在github上使用广泛使用的35k多星模块，并且。前者以更透明的方式处理会话cookie、重定向、编码和压缩。后者在HTML代码中查找部分，并具有易于记忆的语法，例如用于HTML标记属性的[]

它遵循Python3.5.2中的一个完整示例，对于一个不需要JavaScript引擎就可以废弃的网站，您可以使用它，然后在URL中按顺序下载一些带有下载的链接

import shutil
import sys
import requests
from bs4 import BeautifulSoup

""" Requirements: beautifulsoup4, requests """

SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
        'login[login]',
        'login[password]']

client = requests.session()

request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
        KEYS[1]: 'my_username',
        KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
                      data=data,
                      headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
             for tag in soup.find_all('a')
             if 'download' in tag['href'])
for url, name in generator:
    with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
        if request.status_code == 200:
            with open(name, 'wb') as output:
                request.raw.decode_content = True
                shutil.copyfileobj(request.raw, output)
        else:
            print('status code was {} for {}'.format(request.status_code,
                                                     name),
                  file=sys.stderr)

你可以用scrapy或pyspider自己做。还有一个公司scrapinghub创作者scrapy，你可以付费做一些剪贴。