C# 使用未知大小队列的web爬虫的生产者/消费者_C#_Multithreading_Queue_Web Crawler_Producer Consumer

C# 使用未知大小队列的web爬虫的生产者/消费者

c# multithreading web-crawler

C# 使用未知大小队列的web爬虫的生产者/消费者,c#,multithreading,queue,web-crawler,producer-consumer,C#,Multithreading,Queue,Web Crawler,Producer Consumer,我需要对父网页及其子网页进行爬网，并且我遵循了中的生产者/消费者概念。此外，我还使用了5个线程来对链接进行排队和退队如果队列长度未知，那么在所有线程完成队列处理后，有没有关于如何结束/加入所有线程的建议下面是我如何编码它的想法 static void Main(string[] args) { //enqueue parent links here ... //then start crawling via threading ... } public vo

我需要对父网页及其子网页进行爬网，并且我遵循了中的生产者/消费者概念。此外，我还使用了5个线程来对链接进行排队和退队

如果队列长度未知，那么在所有线程完成队列处理后，有没有关于如何结束/加入所有线程的建议

下面是我如何编码它的想法

static void Main(string[] args)
{
    //enqueue parent links here
    ...
    //then start crawling via threading
    ...
}

public void Crawl()
{
   //dequeue
   //get child links
   //enqueue child links
}

您可以在末尾将一个虚拟令牌排队，并让线程在遇到该令牌时退出。比如：

public void Crawl()
{
   int report = 0;
   while(true)
   {
       if(!(queue.Count == 0))      
       {   
          if(report > 0) Interlocked.Decrement(ref report);
          //dequeue     
          if(token == "TERMINATION")
             return;
          else
             //enqueue child links
       }
       else
       {              
          if(report == num_threads) // all threads have signaled empty queue
             queue.Enqueue("TERMINATION");
          else
             Interlocked.Increment(ref report); // this thread has found the queue empty
       }
    }
}

当然，我省略了

入/出队列操作的锁。
线程可能会发出结束工作的信号，例如引发事件或调用委托
static void Main(string[] args)
{
//enqueue parent links here
...
//then start crawling via threading
...
}

public void X()
{
    //block the threads until all of them are here
}

public void Crawl(Action x)
{
    //dequeue
    //get child links
    //enqueue child links
    //call x()
}

如果所有线程都处于空闲状态（即等待队列）且队列为空，则完成
一种简单的处理方法是让线程在尝试访问队列时使用超时。差不多。无论何时TryTake
超时，线程都会更新一个字段，说明它空闲了多长时间：
while (!queue.TryTake(out item, 5000, token))
{
    if (token.IsCancellationRequested)
        break;
    // here, update idle counter
}

然后，您可以有一个每隔15秒左右执行一次的计时器来检查所有线程的空闲计数器。如果所有线程都空闲了一段时间（可能一分钟），那么计时器可以设置取消令牌。这将杀死所有线程。您的主程序也可以监视取消令牌
顺便说一句，您可以在不阻止收藏和取消收藏的情况下完成此操作。您只需创建自己的取消信号机制，如果您在队列上使用锁，则可以使用Monitor.TryEnter
等替换锁语法
还有其他几种方法可以处理这个问题，尽管它们需要对您的程序进行一些重大的重组。
如果您愿意使用。使用AttachToParent
选项创建任务时，子任务将与父任务链接，直到子任务完成，子任务才会完成
class Program
{
    static void Main(string[] args)
    {
        var task = CrawlAsync("http://stackoverflow.com");
        task.Wait();
    }

    static Task CrawlAsync(string url)
    {
        return Task.Factory.StartNew(
            () =>
            {
                string[] children = ExtractChildren(url);
                foreach (string child in children)
                {
                    CrawlAsync(child);
                }
                ProcessUrl(url);
            }, TaskCreationOptions.AttachedToParent);
    }

    static string[] ExtractChildren(string root)
    {
      // Return all child urls here.
    }

    static void ProcessUrl(string url)
    {
      // Process the url here.
    }
}

您可以通过使用Parallel.ForEach
删除一些显式任务创建逻辑，我看不出这能解决问题。你必须知道终点在哪里，然后才能将虚拟令牌排队。@Jim Mischel:必须有一种方法知道，就像不再有子进程链接一样。我的观点是，他最初的问题本质上是，“我如何知道我在终点？”你的回答本质上是，“当你在终点时，将终点令牌排队。”Hmm确定爬虫何时会知道没有更多链接到进程是主要的瓶颈。设置一个计时器可能会有帮助吗？@user611333假设您正在抓取有限数量的页面，那么您最终应该能够找到端点。如果您想在队列结束之前停止爬网，那么您所问的问题与该情况并不相关。可能是，因为子链接也可能是父链接，线程将无法确切知道其工作是否已经结束。