Python web爬虫线程建议-单列表调度_Python_Multithreading_List_Web Crawler

Python web爬虫线程建议-单列表调度

python multithreading list web-crawler

Python web爬虫线程建议-单列表调度,python,multithreading,list,web-crawler,Python,Multithreading,List,Web Crawler,我想在我的web爬虫程序中添加多线程，但我可以看到，爬虫计划链接的方式可能与多线程不兼容。爬虫只会在少数几个新闻网站上活跃，但与其在每个域上启动一个新线程，我更希望在同一个域上打开多个线程。我的网页爬网代码通过以下功能进行操作： def crawl_links(): links_to_crawl.append(domain[0]) while len(links_to_crawl) > 0: link = links_to_crawl[0]

我想在我的web爬虫程序中添加多线程，但我可以看到，爬虫计划链接的方式可能与多线程不兼容。爬虫只会在少数几个新闻网站上活跃，但与其在每个域上启动一个新线程，我更希望在同一个域上打开多个线程。我的网页爬网代码通过以下功能进行操作：

def crawl_links():
    links_to_crawl.append(domain[0])
    while len(links_to_crawl) > 0:
        link = links_to_crawl[0]
        if link in crawled_links or link in ignored_links:
            del links_to_crawl[0]
        else:
            print '\n', link
            try:
                html = get_html(link)
                GetLinks(html)
                SaveFile(html)
                crawled_links.append(links_to_crawl.pop(0))
            except (ValueError, urllib2.URLError, Timeout.Timeout, httplib.IncompleteRead):
                ignored_links.append(link_to_crawl.pop(0))
    print 'Spider finished!'
    print 'Ignored links:\n', ignored_links
    print 'Crawled links:\n', crawled_links
    print 'Relative links\n', relative_links

如果我对线程如何工作的理解是正确的，那么如果我只是在这个过程中打开了多个线程，它们都会抓取相同的链接（可能多次），或者它们会发生一些冲突。在不必详细说明的情况下，您会建议如何重新构造调度以使其与同时运行的多个线程兼容

我对此进行了一些思考，我能想到的唯一解决办法是使用

GetLinks（）

类将链接附加到多个列表，每个线程有一个单独的列表。。。但这似乎是一个相当笨拙的解决方法。

这里是一个我在Python中运行多线程应用程序时使用的通用方案

该方案采用一个输入参数表，并为每行并行执行一个线程

每个线程取一行，并为行中的每个项目按顺序执行一个线程

每个项目都包含固定数量的参数，这些参数传递给执行的线程

输入示例：

table = \
[
    [[12,32,34],[11,20,14],[33,67,56],[10,20,45]],
    [[21,21,67],[44,34,74],[23,12,54],[31,23,13]],
    [[31,67,56],[34,22,67],[87,74,52],[87,74,52]],
]

import threading
import MyClass # This is for you to implement

def RunThreads(outFileName,errFileName):
    # Create a shared object for saving the output of different threads
    outFile = CriticalSection(outFileName)
    # Create a shared object for saving the errors of different threads
    errFile = CriticalSection(errFileName)
    # Run in parallel one thread for each row in the input table
    RunParallelThreads(outFile,errFile)

def RunParallelThreads(outFile,errFile):
    # Create all the parallel threads
    threads = [threading.Thread(target=RunSequentialThreads,args=(outFile,errFile,row)) for row in table]
    # Start all the parallel threads
    for thread in threads: thread.start()
    # Wait for all the parallel threads to complete
    for thread in threads: thread.join()

def RunSequentialThreads(outFile,errFile,row):
    myObject = MyClass()
    for item in row:
        # Create a thread with the arguments given in the current item
        thread = threading.Thread(target=myObject.Run,args=(outFile,errFile,item[0],item[1],item[2]))
        # Start the thread
        thread.start()
        # Wait for the thread to complete, but only up to 600 seconds
        thread.join(600)
        # Terminate the thread if it hasn't completed up to this point
        if thread.isAlive():
            thread._Thread__stop()
            errFile.write('Timeout on arguments: '+item[0]+' '+item[1]+' '+item[2]+'\n')

在本例中，我们将有3个线程并行运行，每个线程依次执行4个线程

为了保持线程平衡，建议每行中的项目数相同

线程方案：

table = \
[
    [[12,32,34],[11,20,14],[33,67,56],[10,20,45]],
    [[21,21,67],[44,34,74],[23,12,54],[31,23,13]],
    [[31,67,56],[34,22,67],[87,74,52],[87,74,52]],
]

import threading
import MyClass # This is for you to implement

def RunThreads(outFileName,errFileName):
    # Create a shared object for saving the output of different threads
    outFile = CriticalSection(outFileName)
    # Create a shared object for saving the errors of different threads
    errFile = CriticalSection(errFileName)
    # Run in parallel one thread for each row in the input table
    RunParallelThreads(outFile,errFile)

def RunParallelThreads(outFile,errFile):
    # Create all the parallel threads
    threads = [threading.Thread(target=RunSequentialThreads,args=(outFile,errFile,row)) for row in table]
    # Start all the parallel threads
    for thread in threads: thread.start()
    # Wait for all the parallel threads to complete
    for thread in threads: thread.join()

def RunSequentialThreads(outFile,errFile,row):
    myObject = MyClass()
    for item in row:
        # Create a thread with the arguments given in the current item
        thread = threading.Thread(target=myObject.Run,args=(outFile,errFile,item[0],item[1],item[2]))
        # Start the thread
        thread.start()
        # Wait for the thread to complete, but only up to 600 seconds
        thread.join(600)
        # Terminate the thread if it hasn't completed up to this point
        if thread.isAlive():
            thread._Thread__stop()
            errFile.write('Timeout on arguments: '+item[0]+' '+item[1]+' '+item[2]+'\n')

下面的类实现了一个可以在并行运行的不同线程之间安全共享的对象。它提供了一个名为

write

的单一接口方法，该方法允许任何线程以安全的方式更新共享对象（即，在此过程中，操作系统不会切换到另一个线程）

上述方案应允许您控制应用程序中的“并行性”和“顺序性”级别

例如，您可以对所有项目使用一行，并使应用程序以完整的顺序方式运行

相反，您可以将每个项目放在单独的行中，并让应用程序以完全并行的方式运行

当然，你可以选择上面的任意组合

注意：

table = \
[
    [[12,32,34],[11,20,14],[33,67,56],[10,20,45]],
    [[21,21,67],[44,34,74],[23,12,54],[31,23,13]],
    [[31,67,56],[34,22,67],[87,74,52],[87,74,52]],
]

import threading
import MyClass # This is for you to implement

def RunThreads(outFileName,errFileName):
    # Create a shared object for saving the output of different threads
    outFile = CriticalSection(outFileName)
    # Create a shared object for saving the errors of different threads
    errFile = CriticalSection(errFileName)
    # Run in parallel one thread for each row in the input table
    RunParallelThreads(outFile,errFile)

def RunParallelThreads(outFile,errFile):
    # Create all the parallel threads
    threads = [threading.Thread(target=RunSequentialThreads,args=(outFile,errFile,row)) for row in table]
    # Start all the parallel threads
    for thread in threads: thread.start()
    # Wait for all the parallel threads to complete
    for thread in threads: thread.join()

def RunSequentialThreads(outFile,errFile,row):
    myObject = MyClass()
    for item in row:
        # Create a thread with the arguments given in the current item
        thread = threading.Thread(target=myObject.Run,args=(outFile,errFile,item[0],item[1],item[2]))
        # Start the thread
        thread.start()
        # Wait for the thread to complete, but only up to 600 seconds
        thread.join(600)
        # Terminate the thread if it hasn't completed up to this point
        if thread.isAlive():
            thread._Thread__stop()
            errFile.write('Timeout on arguments: '+item[0]+' '+item[1]+' '+item[2]+'\n')

在

MyClass

中，您需要实现方法

Run

，该方法将获取

outFile

和

errFile

对象，以及您为每个线程定义的参数。

线程共享相同的内存空间。这既是一个诅咒，也是一个特性，因为它允许您将不同的URL委托给不同的线程（所有线程都从一个全局列表中提取，或者每个线程都有一个本地列表，这很难管理），但这也意味着您必须确保不同的线程不会损坏它们之间共享的任何数据结构（因为几乎没有代码是线程安全的，除非显式编写为线程安全的）。确保工作正常的一种简单方法是使用“锁”或“互斥体”，一次只允许一个线程进入代码的关键部分（如列表）谢谢。我一直在阅读和重读这段代码，虽然它很容易理解，我得到了一些，但对于初学者来说，这里的一些概念非常复杂。如果可能的话，你能为新手概括一下这到底是如何工作的吗？是的。首先，我将更改输入格式：而不是带有参数行的输入文件，我将把整个输入放在一个矩阵中。其次，我将解释一些您可能不知道的多线程问题。请参阅修改后的答案（几分钟后）…答案已修改；请看一看，看它在这一点上是否对您“更有意义”。非常感谢，非常有意义，并提供了很多帮助。