Python urllib2 urlopen读取超时/块_Python_Multithreading_Web Crawler_Urllib2

Python urllib2 urlopen读取超时/块

python multithreading web-crawler

Python urllib2 urlopen读取超时/块,python,multithreading,web-crawler,urllib2,Python,Multithreading,Web Crawler,Urllib2,最近，我正在开发一个用于在url上下载图像的微型爬虫我将urllib2中的openurl（）与f.open（）/f.write（）一起使用：以下是代码片段： # the list for the images' urls imglist = re.findall(regImg,pageHtml) # iterate to download images for index in xrange(1,len(imglist)+1): img = urllib2.urlopen(imgl

最近，我正在开发一个用于在url上下载图像的微型爬虫

我将urllib2中的openurl（）与f.open（）/f.write（）一起使用：

以下是代码片段：

# the list for the images' urls
imglist = re.findall(regImg,pageHtml)

# iterate to download images
for index in xrange(1,len(imglist)+1):
    img = urllib2.urlopen(imglist[index-1])
    f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
    print('To Read...')

    # potential timeout, may block for a long time
    # so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
    f.write(img.read())
    f.close()
    print('Image %d is ready !' % index)

在上面的代码中，img.read（）可能会阻塞很长时间，我希望在此问题下执行一些重试/重新打开图像url操作

我还关注上面代码的有效性，如果要下载的图像数量有点大，使用线程池下载它们似乎更好

有什么建议吗？提前谢谢

p、我发现img对象上的read（）方法可能会导致阻塞，因此仅向urlopen（）添加超时参数似乎没有用。但我发现file对象没有read（）的超时版本。有什么建议吗？非常感谢。

有一个

超时

参数，用于所有阻塞操作（连接建立等）

这个片段取自我的一个项目。我使用线程池一次下载多个文件。它使用

urllib.urlretrieve

，但逻辑是相同的。

url\u和

是（url，path）
元组的列表，num\u concurrent
是要生成的线程数，skip\u existing
跳过文件系统中已经存在的文件的下载
def download_urls(url_and_path_list, num_concurrent, skip_existing):
    # prepare the queue
    queue = Queue.Queue()
    for url_and_path in url_and_path_list:
        queue.put(url_and_path)

    # start the requested number of download threads to download the files
    threads = []
    for _ in range(num_concurrent):
        t = DownloadThread(queue, skip_existing)
        t.daemon = True
        t.start()

    queue.join()

class DownloadThread(threading.Thread):
    def __init__(self, queue, skip_existing):
        super(DownloadThread, self).__init__()
        self.queue = queue
        self.skip_existing = skip_existing

    def run(self):
        while True:
            #grabs url from queue
            url, path = self.queue.get()

            if self.skip_existing and exists(path):
                # skip if requested
                self.queue.task_done()
                continue

            try:
                urllib.urlretrieve(url, path)
            except IOError:
                print "Error downloading url '%s'." % url

            #signals to queue job is done
            self.queue.task_done()

使用创建tje连接时，可以提供超时参数
如文件所述：
可选的timeout参数指定的超时时间（以秒为单位）
阻塞操作，如连接尝试（如果未指定，则
将使用全局默认超时设置）。这实际上只起作用
用于HTTP、HTTPS和FTP连接
有了它，您将能够管理最大等待时间并捕获引发的异常。
我对大量文档进行爬网的方法是使用批处理器来爬网和转储固定大小的块
假设您要抓取一批已知的文档，比如说10万个文档。您可以使用一些逻辑来生成由线程池下载的1000个文档组成的恒定大小的块。一旦对整个区块进行了爬网，就可以在数据库中进行批量插入。然后再处理1000份文件等等
采用此方法可获得的优势：

您可以利用threadpool加快爬网速度
从这个意义上讲，它是容错的，您可以从它上次失败的块继续
您可以根据优先级生成块，即首先抓取重要文档。因此，如果您无法完成整个批次。重要文档将被处理，不太重要的文档可以在下次运行时提取
一个看似有效的丑陋的黑客
import os, socket, threading, errno

def timeout_http_body_read(response, timeout = 60):
    def murha(resp):
        os.close(resp.fileno())
        resp.close()

    # set a timer to yank the carpet underneath the blocking read() by closing the os file descriptor
    t = threading.Timer(timeout, murha, (response,))
    try:
        t.start()
        body = response.read()
        t.cancel()
    except socket.error as se:
        if se.errno == errno.EBADF: # murha happened
            return (False, None)
        raise
    return (True, body)

非常感谢，我要试试这个。我发现img对象上的read（）方法可能会导致阻塞，因此仅向urlopen（）添加超时参数似乎没有用。但我发现file对象没有read（）的超时版本。有什么建议吗？非常感谢。@destiny1020有没有找到解决这个问题的好方法？我在read（）上遇到了导致脚本挂起的块。如果在找到“read is block”问题的答案之前找到此注释。见：