Multithreading 如何在python的爬行器中添加多线程?
我已经使用BeautifulSoup和urllib2编写了一个蜘蛛爬虫程序。这将解析高达2级的所有链接,并收集列表中的所有html页面。我尝试使它成为多线程的,以增加一点速度的Spidering进程,但不知道从哪里开始 下面是代码Multithreading 如何在python的爬行器中添加多线程?,multithreading,python-2.7,beautifulsoup,web-crawler,urllib2,Multithreading,Python 2.7,Beautifulsoup,Web Crawler,Urllib2,我已经使用BeautifulSoup和urllib2编写了一个蜘蛛爬虫程序。这将解析高达2级的所有链接,并收集列表中的所有html页面。我尝试使它成为多线程的,以增加一点速度的Spidering进程,但不知道从哪里开始 下面是代码 #!/usr/bin/python from bs4 import BeautifulSoup import time import urllib2 import sys masterList = [] masterList1 = [] htmlList = []
#!/usr/bin/python
from bs4 import BeautifulSoup
import time
import urllib2
import sys
masterList = []
masterList1 = []
htmlList = []
url = "http://www.securitytube.net"
dictList = []
def spidy(url):
try:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
if soup:
for links in soup.findAll('a',href=True):
ele = links['href']
if ".html" in ele and "http://" in ele:
htmlList.append(ele)
print ele
elif ".html" in ele and "https://" in ele:
htmlList.append(ele)
print ele
else:
masterList.append(ele)
for ele in masterList:
if 'mailto:' in ele:
masterList.remove(ele)
except:
print "url %s is not accessible ... Moving on to the next URL .."%(url)
pass
def level():
masterList1 = list(set(masterList))
for url1 in masterList1:
print "Running Spidy on : %s"%(url1)
print "\n########################################################\n"
spidy(url1)
print "\n########################################################\n"
masterList1.remove(url1)
masterList.remove(url1)
def main():
spidy("http://www.securitytube.net")
level()
level()
print "\n\n\n\n\n********************************************************************"
print htmlList
#
你听说过吗
它非常容易创造一个简单的蜘蛛与此和人们已经发展了多年。它不完全是多线程的,但它在后台使用Twisted,因此它完全是异步的,并且基于事件。你听说过吗
它非常容易创造一个简单的蜘蛛与此和人们已经发展了多年。它不完全是多线程的,但它在后台使用Twisted,因此它完全是异步的,并且基于事件。我将对此进行研究。谢谢。我会调查一下的。谢谢