Multithreading 如何在python的爬行器中添加多线程？_Multithreading_Python 2.7_Beautifulsoup_Web Crawler_Urllib2

Multithreading 如何在python的爬行器中添加多线程？

multithreading python-2.7 web-crawler

Multithreading 如何在python的爬行器中添加多线程？,multithreading,python-2.7,beautifulsoup,web-crawler,urllib2,Multithreading,Python 2.7,Beautifulsoup,Web Crawler,Urllib2,我已经使用BeautifulSoup和urllib2编写了一个蜘蛛爬虫程序。这将解析高达2级的所有链接，并收集列表中的所有html页面。我尝试使它成为多线程的，以增加一点速度的Spidering进程，但不知道从哪里开始下面是代码 #!/usr/bin/python from bs4 import BeautifulSoup import time import urllib2 import sys masterList = [] masterList1 = [] htmlList = []

我已经使用BeautifulSoup和urllib2编写了一个蜘蛛爬虫程序。这将解析高达2级的所有链接，并收集列表中的所有html页面。我尝试使它成为多线程的，以增加一点速度的Spidering进程，但不知道从哪里开始

下面是代码

#!/usr/bin/python
from bs4 import BeautifulSoup
import time
import urllib2
import sys

masterList = []
masterList1 = []
htmlList = []
url = "http://www.securitytube.net"
dictList = []

def spidy(url):
            try:
                    page = urllib2.urlopen(url)

                    soup = BeautifulSoup(page.read())
                    if soup:

                            for links in soup.findAll('a',href=True):

                                    ele = links['href']

                                    if ".html" in ele and "http://" in ele:
                                            htmlList.append(ele)
                                            print ele


                                    elif ".html" in ele and "https://" in ele:
                                            htmlList.append(ele)
                                            print ele

                                    else:
                                            masterList.append(ele)


                    for ele in masterList:
                            if 'mailto:' in ele:
                                    masterList.remove(ele)

            except:
                    print "url %s is not accessible ... Moving on to the next URL .."%(url)
                    pass

def level():
    masterList1 = list(set(masterList))
    for url1 in masterList1:

            print "Running Spidy on : %s"%(url1)
            print "\n########################################################\n"

            spidy(url1)

            print "\n########################################################\n"
            masterList1.remove(url1)


            masterList.remove(url1)

def main():
    spidy("http://www.securitytube.net")
    level()
    level()
    print "\n\n\n\n\n********************************************************************"
    print htmlList

# 你听说过吗

它非常容易创造一个简单的蜘蛛与此和人们已经发展了多年。它不完全是多线程的，但它在后台使用Twisted，因此它完全是异步的，并且基于事件。

你听说过吗

它非常容易创造一个简单的蜘蛛与此和人们已经发展了多年。它不完全是多线程的，但它在后台使用Twisted，因此它完全是异步的，并且基于事件。

我将对此进行研究。谢谢。我会调查一下的。谢谢