C#从wiki页面中刮取数据（屏幕刮取）_C#_Screen Scraping_Screen_Html Agility Pack

C#从wiki页面中刮取数据（屏幕刮取）

C#从wiki页面中刮取数据（屏幕刮取）,c#,screen-scraping,screen,html-agility-pack,C#,Screen Scraping,Screen,Html Agility Pack,我想刮一个维基页面。具体来说, 我的应用程序将允许用户输入车辆的注册号（例如，SBS8988Z），并显示相关信息（在页面上）例如，如果用户在我的应用程序的文本字段中输入SBS8988Z，它应该在该wiki页面上查找该行 SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen) 返回SBS8988Z（SLBP 192/194*）-F&N NutriSoy鲜奶：新加

我想刮一个维基页面。具体来说,

我的应用程序将允许用户输入车辆的注册号（例如，SBS8988Z），并显示相关信息（在页面上）

例如，如果用户在我的应用程序的文本字段中输入SBS8988Z，它应该在该wiki页面上查找该行

SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)

返回SBS8988Z（SLBP 192/194*）-F&N NutriSoy鲜奶：新加坡第一大豆奶！（第二代）

到目前为止，我的代码是（从各种网站复制和编辑的）

WebClient getdeployment=new WebClient（）；
字符串url=”http://sgwiki.com/wiki/Scania_K230UB_(第一批)(第五批)"；
getdeployment.Headers[“用户代理”]=“NextBusApp/GetBusData用户代理”；
string sgwikiresult=getdeployment.DownloadString（url）；// 代码怎么了？坦率地说，一切都好P
页面的格式与您阅读的方式不同。你不能指望通过这种方式得到想要的内容
页面内容（我们感兴趣的部分）如下所示：

部署

SBS8987B
（SLBP 192/194*）


SBS8988Z
（SLBP 192/194*）-F&；N NutriSoy鲜奶：新加坡第一大豆奶！（第二代）


SBS8989X
（SLBP SP）




基本上，我们需要找到包含我们正在寻找的注册号的b
元素。一旦我们找到了那个元素，就得到文本并把它放在一起形成结果。下面是代码：
static string GetVehicleInfo(string reg)
{
    var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";

    // HtmlWeb is a helper class to get pages from the web
    var web = new HtmlAgilityPack.HtmlWeb();

    // Create an HtmlDocument from the contents found at given url
    var doc = web.Load(url);

    // Create an XPath to find the `b` elements which contain the registration numbers
    var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
              + "/following-sibling::p[1]"     // move to the first `p` element (where the actual content is in) after the header
              + "/b";                          // select the `b` elements

    // Get the elements from the specified XPath
    var deployments = doc.DocumentNode.SelectNodes(xpath);

    // Create a LINQ query to find the  requested registration number and generate a result
    var query =
        from b in deployments                 // from the list of registration numbers
        where b.InnerText == reg              // find the registration we're looking for
        select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)

    // The query should yield exactly one result (or we have a problem) or none (null)
    var content = query.SingleOrDefault();

    // Decode the content (to convert stuff like "&amp;" to "&")
    var decoded = System.Net.WebUtility.HtmlDecode(content);

    return decoded;
}

请参见此链接。[[1]：是的，我见过。其中一个网站就是我从中获得这段代码的地方！当然，我编辑了它，但它不起作用：（哈哈！假设这样只能找到标签之间的信息是对的吗？我将实现这一点，并尝试一下。非常感谢，Jeff！n00b问题：那么这段代码之后会发生什么？我是否只是将其粘贴到我的私有void getDeployment_Click（对象发送者，EventArgs e）下节？我还收到一个错误：由于getDeployment\u click返回void，因此return关键字后面不能跟对象表达式。非常感谢！）这只是一个方法。将它粘贴到类中的某个位置，并在需要的任何地方调用它。谢谢Jeff！这对我来说完全有效。如果可能的话，你能解释一下var doc=web.Load（url）之后代码的作用吗部分？谢谢！Jeff-当我尝试查找SBS1903P和SBS2838M时，此页面不起作用。通过阅读代码，我推断此问题是由部署标题下的额外标题引起的。有什么方法解决此问题吗？
static string GetVehicleInfo(string reg)
{
    var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";

    // HtmlWeb is a helper class to get pages from the web
    var web = new HtmlAgilityPack.HtmlWeb();

    // Create an HtmlDocument from the contents found at given url
    var doc = web.Load(url);

    // Create an XPath to find the `b` elements which contain the registration numbers
    var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
              + "/following-sibling::p[1]"     // move to the first `p` element (where the actual content is in) after the header
              + "/b";                          // select the `b` elements

    // Get the elements from the specified XPath
    var deployments = doc.DocumentNode.SelectNodes(xpath);

    // Create a LINQ query to find the  requested registration number and generate a result
    var query =
        from b in deployments                 // from the list of registration numbers
        where b.InnerText == reg              // find the registration we're looking for
        select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)

    // The query should yield exactly one result (or we have a problem) or none (null)
    var content = query.SingleOrDefault();

    // Decode the content (to convert stuff like "&amp;" to "&")
    var decoded = System.Net.WebUtility.HtmlDecode(content);

    return decoded;
}