Python 下载带有scrapy的URL列表到文件名列表，并限制速率_Python_Scrapy

Python 下载带有scrapy的URL列表到文件名列表，并限制速率

python scrapy

Python 下载带有scrapy的URL列表到文件名列表，并限制速率,python,scrapy,Python,Scrapy,我有一个很大的URL列表，我想下载（大约400K），我想使用scrapy的并发下载功能。我发现的最基本的管道示例太复杂了你能给我举一个简单的例子吗 url_list = ['http://www.example.com/index.html', 'http://www.something.com/index.html'] file_list = ['../file1.html', '../file2.html'] 我会将它们存储在如下

我有一个很大的URL列表，我想下载（大约400K），我想使用scrapy的并发下载功能。我发现的最基本的管道示例太复杂了

你能给我举一个简单的例子吗

url_list = ['http://www.example.com/index.html',
            'http://www.something.com/index.html']

file_list = ['../file1.html',
             '../file2.html']

我会将它们存储在如下文件列表中：

url_list = ['http://www.example.com/index.html',
            'http://www.something.com/index.html']

file_list = ['../file1.html',
             '../file2.html']

速率限制将是一个很好的奖励，这样就不会使一个糟糕的服务器过载

注意：如果有其他方法，则不需要使用scrapy。

您可以修改此代码片段以执行所需操作：

import requests
import grequests

def exception_handler(request, exception):
    print "Request failed"


def chop(seq,size):
    """Chop a sequence into chunks of the given size."""
    chunk = lambda i: seq[i:i+size]
    return map(chunk,xrange(0,len(seq),size))


def get_chunk(chunk):
    reqs = (grequests.get(u) for u in chunk)
    foo = grequests.map(reqs)
    for r in foo:
        player_id = r.request.url.split('=')[-1]
        print r.status_code, player_id, r.request.url, len(r.content)
        open('data/%s.html' %player_id, 'w').write(r.content)



urls = [a.strip() for a in open('temp/urls.txt').read().split('\n') if a]

chunks = chop(urls, 150)

for chunk in chunks:
    get_chunk(chunk)

哦，太好了！这看起来比scrapy好得多——我认为它也可以在python 3中工作。非常感谢。