HTML Python请求库-太慢_Python_Python Requests

HTML Python请求库-太慢

python

HTML Python请求库-太慢,python,python-requests,Python,Python Requests,我使用python请求库获取URL的源代码，并使用以下代码应用正则表达式提取一些数据： for url in urls: print url page = requests.get(url) matches = re.findall('btn btn-primary font-bold">\s*<span>([^<]*)', page.text) for match in matches: print match 这段代码

我使用python请求库获取URL的源代码，并使用以下代码应用正则表达式提取一些数据：

for url in urls:
    print url
    page = requests.get(url)
    matches = re.findall('btn btn-primary font-bold">\s*<span>([^<]*)', page.text)
    for match in matches:
        print match

这段代码可以工作，但速度太慢了；每个请求需要5秒以上的时间。有什么建议可以加快速度吗

另外-我是否应该添加任何try/error代码以增强健壮性

我同意上面的评论，速度评测是一个很好的方法，可以看出是什么让你慢下来了。如果这是一个选项，一个明显的加速代码的方法就是将其并行化。这里有一个简单的建议

from multiprocessing.dummy import Pool as Threadpool
import requests
import re


def parallelURL(url):
    print url
    page = requests.get(url)
    matches = re.findall('btn btn-primary font-bold">\s*<span>([^<]*)', page.text)
    for match in matches:
       print match

pool = Threadpool(6)  #play around with this number depends on processor

pool.map(parallelURL,urlList)

在我的电脑上，从1.9秒到0.3秒，谷歌的访问速度提高了10倍

我发现，对于较大的文件下载，将正文分块下载要快得多。默认情况下，我认为geturi，stream=False使用的块大小为1

import StringIO, requests

# Get the HTTP header
r = requests.get(uri, stream=True)
# Read the body in 1KB chunks
http_body_str = StringIO.StringIO()
for chunk in r.iter_content(chunk_size=1024):
    http_body_str.write(chunk)
http_body = http_body_str.getvalue()
http_body_str.close()

对于二进制数据，我认为您可以使用io.BytesIO而不是StringIO。

您测量过实际的速度吗？使用curl获取页面内容需要多长时间？这个正则表达式需要多长时间才能运行？为什么要使用正则表达式来解析HTML？例如，为什么不使用BeautifulSoup呢？您可以使用PythonCprofile模块查看在哪里花费的时间最多。