C# 大量下载网页#_C#_Web Crawler

C# 大量下载网页#

c# web-crawler

C# 大量下载网页#,c#,web-crawler,C#,Web Crawler,我的应用程序要求我将大量网页下载到内存中，以便进一步解析和处理。最快的方法是什么？我当前的方法（如下所示）似乎太慢，偶尔会导致超时 for (int i = 1; i<=pages; i++) { string page_specific_link = baseurl + "&page=" + i.ToString(); try { WebClient client = new WebClient(); var pa

我的应用程序要求我将大量网页下载到内存中，以便进一步解析和处理。最快的方法是什么？我当前的方法（如下所示）似乎太慢，偶尔会导致超时

for (int i = 1; i<=pages; i++)
{
    string page_specific_link = baseurl + "&page=" + i.ToString();

    try
    {    
        WebClient client = new WebClient();
        var pagesource = client.DownloadString(page_specific_link);
        client.Dispose();
        sourcelist.Add(pagesource);
    }
    catch (Exception)
    {
    }
}

对于（inti=1；i您应该为此使用并行编程
有很多方法可以实现你想要的；最简单的方法是：
var pageList = new List<string>();

for (int i = 1; i <= pages; i++)
{
  pageList.Add(baseurl + "&page=" + i.ToString());
}


// pageList  is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
  try
    {
      WebClient client = new WebClient();
      var pagesource = client.DownloadString(page);
      client.Dispose();
      lock (sourcelist)
      sourcelist.Add(pagesource);
    }

    catch (Exception) {}
});

var pageList=新列表（）；
对于（int i=1；i
{
尝试
{
WebClient客户端=新的WebClient（）；
var pagesource=client.DownloadString（第页）；
client.Dispose（）；
锁（源列表）
sourcelist.Add（pagesource）；
}
捕获（异常）{}
});
我有一个类似的案例，我就是这样解决的
using System;
    using System.Threading;
    using System.Collections.Generic;
    using System.Net;
    using System.IO;

namespace WebClientApp
{
class MainClassApp
{
    private static int requests = 0;
    private static object requests_lock = new object();

    public static void Main() {

        List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
        foreach(var url in urls) {
            ThreadPool.QueueUserWorkItem(GetUrl, url);
        }

        int cur_req = 0;

        while(cur_req<urls.Count) {

            lock(requests_lock) {
                cur_req = requests; 
            }

            Thread.Sleep(1000);
        }

        Console.WriteLine("Done");
    }

private static void GetUrl(Object the_url) {

        string url = (string)the_url;
        WebClient client = new WebClient();
        Stream data = client.OpenRead (url);

        StreamReader reader = new StreamReader(data);
        string html = reader.ReadToEnd ();

        /// Do something with html
        Console.WriteLine(html);

        lock(requests_lock) {
            //Maybe you could add here the HTML to SourceList
            requests++; 
        }
    }
}

使用系统；
使用系统线程；
使用System.Collections.Generic；
Net系统；
使用System.IO；
命名空间WebClientTapp
{
类MainClassApp
{
私有静态int请求=0；
私有静态对象请求_lock=new object（）；
公共静态void Main（）{
列表URL=新列表{”http://www.google.com", "http://www.slashdot.org"};
foreach（url中的变量url）{
QueueUserWorkItem（GetUrl，url）；
}
int cur_req=0；
而（cur_req处理此问题的方式将在很大程度上取决于您想要下载多少页面，以及您引用了多少站点
我会使用一个很好的整数，比如1000。如果你想从一个站点下载那么多页面，这比你想下载分布在几十个或几百个站点上的1000个页面要长得多。原因是，如果你访问一个站点时有一大堆并发请求，你可能最终会被删除g阻塞
所以你必须实施一种“礼貌政策”这会在单个站点上的多个请求之间产生延迟。延迟的长度取决于许多因素。如果站点的robots.txt文件有爬网延迟
条目，您应该尊重这一点。如果他们不希望您每分钟访问超过一个页面，那么这与您应该爬网的速度一样快。如果没有爬网延迟y
，您应该根据站点响应所需的时间来确定延迟。例如，如果您可以在500毫秒内从站点下载页面，则将延迟设置为X。如果需要一整秒，则将延迟设置为2X。您可能可以将延迟限制为60秒（除非爬网延迟
更长），我建议您将最小延迟设置为5到10秒
我不建议为此使用Parallel.ForEach
。我的测试表明它做得不好。有时它会对连接过度征税，通常不允许足够的并发连接。我会创建一个WebClient
实例队列，然后编写如下内容：
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances

// now process urls
foreach (var url in urls_to_download)
{
    var worker = ClientQueue.Take();
    worker.DownloadStringAsync(url, ...);
}

第一个为所有请求分配一个WebClient
实例。第二个为每个请求分配一个WebClient
。差别很大。WebClient
使用了大量的系统资源，在相对较短的时间内分配数千个会影响性能。相信我……我很抱歉我遇到过这种情况。您最好只分配10或20个WebClient
s（并发处理所需的数量），而不是为每个请求分配一个。
此外，我想为他的方法添加一个稍微干净的“版本”
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();

Parallel.ForEach(pages, x =>
{
    using(var client = new WebClient())
    {
        var pagesource = client.DownloadString(x);
        sources.Add(pagesource);
    }
});

var pages=新列表{”http://bing.com", "http://stackoverflow.com" };
var sources=new BlockingCollection（）；
Parallel.ForEach（页面，x=>
{
使用（var client=new WebClient（））
{
var pagesource=client.DownloadString（x）；
sources.Add（pagesource）；
}
});


另一种使用异步的方法：
static IEnumerable<string> GetSources(List<string> pages)
{
    var sources = new BlockingCollection<string>();
    var latch = new CountdownEvent(pages.Count);

    foreach (var p in pages)
    {
        using (var wc = new WebClient())
        {
            wc.DownloadStringCompleted += (x, e) =>
            {
                sources.Add(e.Result);
                latch.Signal();
            };

            wc.DownloadStringAsync(new Uri(p));
        }
    }

    latch.Wait();

    return sources;
}

静态IEnumerable GetSources（列表页）
{
var sources=new BlockingCollection（）；
var latch=新的倒计时事件（pages.Count）；
foreach（页中的var p）
{
使用（var wc=new WebClient（））
{
wc.DownloadStringCompleted+=（x，e）=>
{
来源。添加（如结果）；
闩锁。信号（）；
};
DownloadStringAsync（新Uri（p））；
}
}
lock.Wait（）；
返回源；
}
虽然其他答案完全正确，但所有答案（在撰写本文时）都忽略了一件非常重要的事情：对web的调用是，让线程等待这样的操作将使系统资源紧张，并对系统资源产生影响
您真正想要做的是利用上的异步方法（正如一些人指出的）以及的处理能力
首先，您将获得要下载的URL：
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl + 
    "&page=" + i.ToString(CultureInfo.InvariantCulture)));

然后，您可以在任务中使用实例来获取url和内容对：
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
    // pair.Item1 will contain the Uri.
    // pair.Item2 will contain the content.
}

//循环查看每个结果。
foreach（materializedTasks.Select中的元组对（t=>t.Result））
{
//pair.Item1将包含Uri。
//pair.Item2将包含内容。
}

注意，上面的代码有一个警告：没有错误处理
如果您希望获得更高的吞吐量，而不是等待整个列表完成，您可以在下载完成后处理单个页面的内容；任务
的作用就像管道一样，当您完成工作单元后，让它继续下一个，而不是等待所有项目完成完成（如果可以以异步方式完成）。
为什么不使用web爬行框架呢。它可以为您处理所有的事情（多线程、httprequests、解析链接、调度、礼貌等等）
Abot（）为您处理所有这些内容，并用c#编写。我使用活动线程计数和arbit
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
    // Create the task completion source.
    var tcs = new TaskCompletionSource<Tuple<Uri, string>>();

    // The web client.
    var wc = new WebClient();

    // Attach to the DownloadStringCompleted event.
    client.DownloadStringCompleted += (s, e) => {
        // Dispose of the client when done.
        using (wc)
        {
            // If there is an error, set it.
            if (e.Error != null) 
            {
                tcs.SetException(e.Error);
            }
            // Otherwise, set cancelled if cancelled.
            else if (e.Cancelled) 
            {
                tcs.SetCanceled();
            }
            else 
            {
                // Set the result.
                tcs.SetResult(new Tuple<string, string>(url, e.Result));
            }
        }
    };

    // Start the process asynchronously, don't burn a thread.
    wc.DownloadStringAsync(url);

    // Return the task.
    return tcs.Task;
});

// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();

// Wait for all to complete.
Task.WaitAll(materializedTasks);

// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
    // pair.Item1 will contain the Uri.
    // pair.Item2 will contain the content.
}

private static volatile int activeThreads = 0;

public static void RecordData()
{
  var nbThreads = 10;
  var source = db.ListOfUrls; // Thousands urls
  var iterations = source.Length / groupSize; 
  for (int i = 0; i < iterations; i++)
  {
    var subList = source.Skip(groupSize* i).Take(groupSize);
    Parallel.ForEach(subList, (item) => RecordUri(item)); 
    //I want to wait here until process further data to avoid overload
    while (activeThreads > 30) Thread.Sleep(100);
  }
}

private static async Task RecordUri(Uri uri)
{
   using (WebClient wc = new WebClient())
   {
      Interlocked.Increment(ref activeThreads);
      wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
      var jsonData = "";
      RootObject root;
      jsonData = await wc.DownloadStringTaskAsync(uri);
      var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
      RecordData(root)
    }
}