Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/sharepoint/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Multithreading 如何在python的爬行器中添加多线程?_Multithreading_Python 2.7_Beautifulsoup_Web Crawler_Urllib2 - Fatal编程技术网

Multithreading 如何在python的爬行器中添加多线程?

Multithreading 如何在python的爬行器中添加多线程?,multithreading,python-2.7,beautifulsoup,web-crawler,urllib2,Multithreading,Python 2.7,Beautifulsoup,Web Crawler,Urllib2,我已经使用BeautifulSoup和urllib2编写了一个蜘蛛爬虫程序。这将解析高达2级的所有链接,并收集列表中的所有html页面。我尝试使它成为多线程的,以增加一点速度的Spidering进程,但不知道从哪里开始 下面是代码 #!/usr/bin/python from bs4 import BeautifulSoup import time import urllib2 import sys masterList = [] masterList1 = [] htmlList = []

我已经使用BeautifulSoup和urllib2编写了一个蜘蛛爬虫程序。这将解析高达2级的所有链接,并收集列表中的所有html页面。我尝试使它成为多线程的,以增加一点速度的Spidering进程,但不知道从哪里开始

下面是代码

#!/usr/bin/python
from bs4 import BeautifulSoup
import time
import urllib2
import sys

masterList = []
masterList1 = []
htmlList = []
url = "http://www.securitytube.net"
dictList = []

def spidy(url):
            try:
                    page = urllib2.urlopen(url)

                    soup = BeautifulSoup(page.read())
                    if soup:

                            for links in soup.findAll('a',href=True):

                                    ele = links['href']

                                    if ".html" in ele and "http://" in ele:
                                            htmlList.append(ele)
                                            print ele


                                    elif ".html" in ele and "https://" in ele:
                                            htmlList.append(ele)
                                            print ele

                                    else:
                                            masterList.append(ele)


                    for ele in masterList:
                            if 'mailto:' in ele:
                                    masterList.remove(ele)

            except:
                    print "url %s is not accessible ... Moving on to the next URL .."%(url)
                    pass

def level():
    masterList1 = list(set(masterList))
    for url1 in masterList1:

            print "Running Spidy on : %s"%(url1)
            print "\n########################################################\n"

            spidy(url1)

            print "\n########################################################\n"
            masterList1.remove(url1)


            masterList.remove(url1)

def main():
    spidy("http://www.securitytube.net")
    level()
    level()
    print "\n\n\n\n\n********************************************************************"
    print htmlList
# 你听说过吗

它非常容易创造一个简单的蜘蛛与此和人们已经发展了多年。它不完全是多线程的,但它在后台使用Twisted,因此它完全是异步的,并且基于事件。

你听说过吗


它非常容易创造一个简单的蜘蛛与此和人们已经发展了多年。它不完全是多线程的,但它在后台使用Twisted,因此它完全是异步的,并且基于事件。

我将对此进行研究。谢谢。我会调查一下的。谢谢