c#：爬虫项目_C#_Asp.net - Fatal编程技术网

c#：爬虫项目

c# asp.net

c#：爬虫项目,c#,asp.net,C#,Asp.net,我是否可以很容易地了解以下代码示例：使用浏览器控件向目标网站启动请求从目标网站捕获响应将响应转换为DOM对象迭代DOM对象并捕获“FirstName”、“LastName”等，如果它们是响应的一部分谢谢您可以查看和/或。下面是一个使用SgmlReader选择DOM中包含一些文本的所有节点的示例： class Program { static void Main() { using (var reader = new SgmlReader())

我是否可以很容易地了解以下代码示例：

使用浏览器控件向目标网站启动请求

从目标网站捕获响应

将响应转换为DOM对象

迭代DOM对象并捕获“FirstName”、“LastName”等，如果它们是响应的一部分

谢谢

您可以查看和/或。下面是一个使用

SgmlReader

选择DOM中包含一些文本的所有节点的示例：

class Program
{
    static void Main()
    {
        using (var reader = new SgmlReader())
        {
            reader.Href = "http://www.microsoft.com";
            var doc = new XmlDocument();
            doc.Load(reader);
            var nodes = doc.SelectNodes("//*[contains(text(), 'Products')]");
            foreach (XmlNode node in nodes)
            {
                Console.WriteLine(node.OuterXml);
            }
        }
    }
}

下面是使用WebRequest对象检索数据并将响应捕获为流的代码

    public static Stream GetExternalData( string url, string postData, int timeout )
    {
        ServicePointManager.ServerCertificateValidationCallback += delegate( object sender,
                                                                                X509Certificate certificate,
                                                                                X509Chain chain,
                                                                                SslPolicyErrors sslPolicyErrors )
        {
            // if we trust the callee implicitly, return true...otherwise, perform validation logic
            return [bool];
        };

        WebRequest request = null;
        HttpWebResponse response = null;

        try
        {
            request = WebRequest.Create( url );
            request.Timeout = timeout; // force a quick timeout

            if( postData != null )
            {
                request.Method = "POST";
                request.ContentType = "application/x-www-form-urlencoded";
                request.ContentLength = postData.Length;

                using( StreamWriter requestStream = new StreamWriter( request.GetRequestStream(), System.Text.Encoding.ASCII ) )
                {
                    requestStream.Write( postData );
                    requestStream.Close();
                }
            }

            response = (HttpWebResponse)request.GetResponse();
        }
        catch( WebException ex )
        {
            Log.LogException( ex );
        }
        finally
        {
            request = null;
        }

        if( response == null || response.StatusCode != HttpStatusCode.OK )
        {
            if( response != null )
            {
                response.Close();
                response = null;
            }

            return null;
        }

        return response.GetResponseStream();
    }

为了管理响应，我使用了一个定制的Xhtml解析器，但它有数千行代码。有几个公开可用的解析器（见Darin的评论）

编辑：根据OP的问题，可以将头添加到请求中以模拟用户代理。例如：

request = (HttpWebRequest)WebRequest.Create( url );
                request.Accept = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/x-shockwave-flash, */*";
                request.Timeout = timeout;
                request.Headers.Add( "Cookie", cookies );

                //
                // manifest as a standard user agent
                request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)";

您还可以使用selenium轻松地遍历DOM并获取字段的值。它还将自动为您打开浏览器。

您可以找到从4个部分到所需内容的教程

这是第一个，这四个部分是（如何编写搜索引擎）

如果你想要一个纯C#方式遍历网页，最好的地方是。它允许您轻松打开web浏览器，并通过C#代码浏览网页（和操作）

下面是一个使用API搜索google的示例（取自他们的文档）

}

为什么要使用浏览器控件，而不仅仅是使用WebClient对象（或System.Net.WebRequest）？不要为此使用

WebBrowser

控件。Tim，SLaks，除了WebBrowser控件之外，您还有什么建议。我想让我的请求在目标网站上看起来像人一样。@dotnet-查看我的回复。您可以传递任何标题，以使您的请求看起来像特定的用户代理。不，这不是一个家庭作业。这是一个让我的锯子更锋利的项目

using System;
using WatiN.Core;

namespaceWatiNGettingStarted
{
  class WatiNConsoleExample
  {
    [STAThread]
    static void Main(string[] args)
    {
      // Open a new Internet Explorer window and
      // goto the google website.
      IE ie = new IE("http://www.google.com");

      // Find the search text field and type Watin in it.
      ie.TextField(Find.ByName("q")).TypeText("WatiN");

      // Click the Google search button.
      ie.Button(Find.ByValue("Google Search")).Click();

      // Uncomment the following line if you want to close
      // Internet Explorer and the console window immediately.
      //ie.Close();
    }
  }