C#从wiki页面中刮取数据(屏幕刮取)

C#从wiki页面中刮取数据(屏幕刮取),c#,screen-scraping,screen,html-agility-pack,C#,Screen Scraping,Screen,Html Agility Pack,我想刮一个维基页面。具体来说, 我的应用程序将允许用户输入车辆的注册号(例如,SBS8988Z),并显示相关信息(在页面上) 例如,如果用户在我的应用程序的文本字段中输入SBS8988Z,它应该在该wiki页面上查找该行 SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen) 返回SBS8988Z(SLBP 192/194*)-F&N NutriSoy鲜奶:新加

我想刮一个维基页面。具体来说,

我的应用程序将允许用户输入车辆的注册号(例如,SBS8988Z),并显示相关信息(在页面上)

例如,如果用户在我的应用程序的文本字段中输入SBS8988Z,它应该在该wiki页面上查找该行

SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
返回SBS8988Z(SLBP 192/194*)-F&N NutriSoy鲜奶:新加坡第一大豆奶!(第二代)

到目前为止,我的代码是(从各种网站复制和编辑的)

WebClient getdeployment=new WebClient();
字符串url=”http://sgwiki.com/wiki/Scania_K230UB_(第一批)(第五批)";
getdeployment.Headers[“用户代理”]=“NextBusApp/GetBusData用户代理”;

string sgwikiresult=getdeployment.DownloadString(url);// 代码怎么了?坦率地说,一切都好P

页面的格式与您阅读的方式不同。你不能指望通过这种方式得到想要的内容

页面内容(我们感兴趣的部分)如下所示:


部署

SBS8987B
(SLBP 192/194*)

SBS8988Z (SLBP 192/194*)-F&;N NutriSoy鲜奶:新加坡第一大豆奶!(第二代)
SBS8989X (SLBP SP)

基本上,我们需要找到包含我们正在寻找的注册号的
b
元素。一旦我们找到了那个元素,就得到文本并把它放在一起形成结果。下面是代码:

static string GetVehicleInfo(string reg)
{
    var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";

    // HtmlWeb is a helper class to get pages from the web
    var web = new HtmlAgilityPack.HtmlWeb();

    // Create an HtmlDocument from the contents found at given url
    var doc = web.Load(url);

    // Create an XPath to find the `b` elements which contain the registration numbers
    var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
              + "/following-sibling::p[1]"     // move to the first `p` element (where the actual content is in) after the header
              + "/b";                          // select the `b` elements

    // Get the elements from the specified XPath
    var deployments = doc.DocumentNode.SelectNodes(xpath);

    // Create a LINQ query to find the  requested registration number and generate a result
    var query =
        from b in deployments                 // from the list of registration numbers
        where b.InnerText == reg              // find the registration we're looking for
        select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)

    // The query should yield exactly one result (or we have a problem) or none (null)
    var content = query.SingleOrDefault();

    // Decode the content (to convert stuff like "&" to "&")
    var decoded = System.Net.WebUtility.HtmlDecode(content);

    return decoded;
}

请参见此链接。[[1]:是的,我见过。其中一个网站就是我从中获得这段代码的地方!当然,我编辑了它,但它不起作用:(哈哈!假设这样只能找到标签之间的信息是对的吗?我将实现这一点,并尝试一下。非常感谢,Jeff!n00b问题:那么这段代码之后会发生什么?我是否只是将其粘贴到我的私有void getDeployment_Click(对象发送者,EventArgs e)下节?我还收到一个错误:由于getDeployment\u click返回void,因此return关键字后面不能跟对象表达式。非常感谢!)这只是一个方法。将它粘贴到类中的某个位置,并在需要的任何地方调用它。谢谢Jeff!这对我来说完全有效。如果可能的话,你能解释一下var doc=web.Load(url)之后代码的作用吗部分?谢谢!Jeff-当我尝试查找SBS1903P和SBS2838M时,此页面不起作用。通过阅读代码,我推断此问题是由部署标题下的额外标题引起的。有什么方法解决此问题吗?
static string GetVehicleInfo(string reg)
{
    var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";

    // HtmlWeb is a helper class to get pages from the web
    var web = new HtmlAgilityPack.HtmlWeb();

    // Create an HtmlDocument from the contents found at given url
    var doc = web.Load(url);

    // Create an XPath to find the `b` elements which contain the registration numbers
    var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
              + "/following-sibling::p[1]"     // move to the first `p` element (where the actual content is in) after the header
              + "/b";                          // select the `b` elements

    // Get the elements from the specified XPath
    var deployments = doc.DocumentNode.SelectNodes(xpath);

    // Create a LINQ query to find the  requested registration number and generate a result
    var query =
        from b in deployments                 // from the list of registration numbers
        where b.InnerText == reg              // find the registration we're looking for
        select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)

    // The query should yield exactly one result (or we have a problem) or none (null)
    var content = query.SingleOrDefault();

    // Decode the content (to convert stuff like "&" to "&")
    var decoded = System.Net.WebUtility.HtmlDecode(content);

    return decoded;
}