Python web抓取速度逐渐变慢并最终停止的可能原因是什么？_Python_Web Scraping

Python web抓取速度逐渐变慢并最终停止的可能原因是什么？

python web-scraping

Python web抓取速度逐渐变慢并最终停止的可能原因是什么？,python,web-scraping,Python,Web Scraping,我在python中使用urllib2和BeautifulSoup进行web抓取，并不断将抓取的内容保存到文件中。我注意到我的进步越来越慢，最终在4到8小时内停止，即使是像这样简单的事情 import urllib2 from bs4 import BeautifulSoup def searchBook(): fb = open(r'filePath', 'a') for index in range(3510000,3520000): url = 'http:

我在python中使用urllib2和BeautifulSoup进行web抓取，并不断将抓取的内容保存到文件中。我注意到我的进步越来越慢，最终在4到8小时内停止，即使是像这样简单的事情

import urllib2
from bs4 import BeautifulSoup

def searchBook():
    fb = open(r'filePath', 'a')
    for index in range(3510000,3520000):
        url = 'http://www.qidian.com/Book/' + str(index) + '.aspx'
        try:
            html = urllib2.urlopen(url,'html').read()
            soup = BeautifulSoup(html)
            stats = getBookStats(soup)
            fb.write(str(stats))
            fb.write('\n')                
        except:
            print url + 'doesn't exist'
    fb.close()


def getBookStats(soup):                                         # extract book info from script
    stats = {}
    stats['trialStatus'] = soup.find_all('span',{'itemprop':'trialStatus'})[0].string
    stats['totalClick'] = soup.find_all('span',{'itemprop':'totalClick'})[0].string
    stats['monthlyClick'] = soup.find_all('span',{'itemprop':'monthlyClick'})[0].string
    stats['weeklyClick'] = soup.find_all('span',{'itemprop':'weeklyClick'})[0].string
    stats['genre'] = soup.find_all('span',{'itemprop':'genre'})[0].string
    stats['totalRecommend'] = soup.find_all('span',{'itemprop':'totalRecommend'})[0].string
    stats['monthlyRecommend'] = soup.find_all('span',{'itemprop':'monthlyRecommend'})[0].string
    stats['weeklyRecommend'] = soup.find_all('span',{'itemprop':'weeklyRecommend'})[0].string
    stats['updataStatus'] = soup.find_all('span',{'itemprop':'updataStatus'})[0].string
    stats['wordCount'] = soup.find_all('span',{'itemprop':'wordCount'})[0].string
    stats['dateModified'] = soup.find_all('span',{'itemprop':'dateModified'})[0].string
    return stats

我的问题是

1）这段代码的瓶颈是什么，urllib2.urlopen（）还是soup.find_all（）

2）我能判断代码是否已停止的唯一方法是检查输出文件。然后，我从停止的位置手动重新启动进程。有没有更有效的方法来判断代码是否已停止？有没有办法自动重启

3）当然，最好的办法是防止代码完全变慢和停止。我可以检查哪些地方

我目前正在尝试从答案和评论中提出建议

1） @davidermann

url = 'http://www.qidian.com/BookReader/' + str(3532901) + '.aspx'
with urllib2.urlopen(url,'html') as u: html = u.read()
# html = urllib2.urlopen(url,'html').read()
--------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-32-8b6f635f6bd5> in <module>()
      1 url = 'http://www.qidian.com/BookReader/' + str(3532901) + '.aspx'
----> 2 with urllib2.urlopen(url,'html') as u: html = u.read()
      3 html = urllib2.urlopen(url,'html').read()
      4 soup = BeautifulSoup(html)

AttributeError: addinfourl instance has no attribute '__exit__'

url='1〕http://www.qidian.com/BookReader/“+str（3532901）+”.aspx”
使用urllib2.urlopen（url，'html'）作为u:html=u.read（）
#html=urllib2.urlopen（url，'html'）.read（）
--------------------------------------------------------------------------
AttributeError回溯（最近一次呼叫上次）
在（）
1 url=http://www.qidian.com/BookReader/“+str（3532901）+”.aspx”
---->2将urllib2.urlopen（url，'html'）作为u:html=u.read（）
3 html=urllib2.urlopen（url，'html'）.read（）
4汤=美汤（html）
AttributeError:AddInfo实例没有属性“\uuuu exit\uuuu”

2） @Stardustone

在不同位置添加sleep（）命令后，程序仍然停止。

我怀疑系统平均负载过高，请尝试在每次迭代的

try

部分中添加

sleep（0.5）

：

     try:
        html = urllib2.urlopen(url,'html').read()
        soup = BeautifulSoup(html)
        stats = getBookStats(soup)
        fb.write(str(stats))
        fb.write('\n')
        time.sleep(0.5)

有关如何测试函数调用所用时间的信息，请参见。这将允许您确定是否是

urlopen（）

变慢了

正如@halfer所说，很可能是您正在抓取的网站不希望您抓取太多内容，并且正在逐步限制您的请求。检查他们的服务条款，并检查他们是否提供API作为刮取的替代方案。

首先

例外情况除外，如e:

和

打印e

，因此您实际上得到了一些错误信息好主意！非常感谢。此外，可能您的刮取目标检测到您进行了多次抓取，并开始限制它们。也许可以添加一个计时器来查看速度慢的原因——如果它在提取过程中，您可能需要添加一些睡眠来防止油门启动。您获取的速度（请求/秒）有多快？您在一个会话中执行多少任务？@halfer我之前的问题是针对多线程的，即我希望并行获取URL。这一条描述了作业自行终止的情况。如何计算每秒获取多少请求？我是否添加了计时器，或者是否有可以调用的urllib内置函数？“如何计算每秒获取多少请求？”-我不知道，我不使用Python。但我要说的是，这是一件需要研究的重要事情。因为你提出的几个请求似乎并不相互依赖。也许您可以考虑创建并发请求或多线程库？这将如何提高性能？此外，URL获取和HTML解析往往更多地绑定在IO上，而不是解析HTML上，因此我怀疑机器的CPU负载是否太大。@Coeus这是我代码的一个子例程，它的其他部分嵌套了URL请求（尽管是独立的）（请参见[）。我相信这与多线程是不同的问题，我也希望多线程。@StardustOne不幸的是，这种方法也不起作用。