C#从wiki页面中刮取数据(屏幕刮取)
我想刮一个维基页面。具体来说, 我的应用程序将允许用户输入车辆的注册号(例如,SBS8988Z),并显示相关信息(在页面上) 例如,如果用户在我的应用程序的文本字段中输入SBS8988Z,它应该在该wiki页面上查找该行C#从wiki页面中刮取数据(屏幕刮取),c#,screen-scraping,screen,html-agility-pack,C#,Screen Scraping,Screen,Html Agility Pack,我想刮一个维基页面。具体来说, 我的应用程序将允许用户输入车辆的注册号(例如,SBS8988Z),并显示相关信息(在页面上) 例如,如果用户在我的应用程序的文本字段中输入SBS8988Z,它应该在该wiki页面上查找该行 SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen) 返回SBS8988Z(SLBP 192/194*)-F&N NutriSoy鲜奶:新加
SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
返回SBS8988Z(SLBP 192/194*)-F&N NutriSoy鲜奶:新加坡第一大豆奶!(第二代)
到目前为止,我的代码是(从各种网站复制和编辑的)
WebClient getdeployment=new WebClient();
字符串url=”http://sgwiki.com/wiki/Scania_K230UB_(第一批)(第五批)";
getdeployment.Headers[“用户代理”]=“NextBusApp/GetBusData用户代理”;
string sgwikiresult=getdeployment.DownloadString(url);// 代码怎么了?坦率地说,一切都好P
页面的格式与您阅读的方式不同。你不能指望通过这种方式得到想要的内容
页面内容(我们感兴趣的部分)如下所示:
部署
SBS8987B
(SLBP 192/194*)
SBS8988Z
(SLBP 192/194*)-F&;N NutriSoy鲜奶:新加坡第一大豆奶!(第二代)
SBS8989X
(SLBP SP)
基本上,我们需要找到包含我们正在寻找的注册号的b
元素。一旦我们找到了那个元素,就得到文本并把它放在一起形成结果。下面是代码:
static string GetVehicleInfo(string reg)
{
var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";
// HtmlWeb is a helper class to get pages from the web
var web = new HtmlAgilityPack.HtmlWeb();
// Create an HtmlDocument from the contents found at given url
var doc = web.Load(url);
// Create an XPath to find the `b` elements which contain the registration numbers
var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
+ "/following-sibling::p[1]" // move to the first `p` element (where the actual content is in) after the header
+ "/b"; // select the `b` elements
// Get the elements from the specified XPath
var deployments = doc.DocumentNode.SelectNodes(xpath);
// Create a LINQ query to find the requested registration number and generate a result
var query =
from b in deployments // from the list of registration numbers
where b.InnerText == reg // find the registration we're looking for
select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)
// The query should yield exactly one result (or we have a problem) or none (null)
var content = query.SingleOrDefault();
// Decode the content (to convert stuff like "&" to "&")
var decoded = System.Net.WebUtility.HtmlDecode(content);
return decoded;
}
请参见此链接。[[1]:是的,我见过。其中一个网站就是我从中获得这段代码的地方!当然,我编辑了它,但它不起作用:(哈哈!假设这样只能找到标签之间的信息是对的吗?我将实现这一点,并尝试一下。非常感谢,Jeff!n00b问题:那么这段代码之后会发生什么?我是否只是将其粘贴到我的私有void getDeployment_Click(对象发送者,EventArgs e)下节?我还收到一个错误:由于getDeployment\u click返回void,因此return关键字后面不能跟对象表达式。非常感谢!)这只是一个方法。将它粘贴到类中的某个位置,并在需要的任何地方调用它。谢谢Jeff!这对我来说完全有效。如果可能的话,你能解释一下var doc=web.Load(url)之后代码的作用吗部分?谢谢!Jeff-当我尝试查找SBS1903P和SBS2838M时,此页面不起作用。通过阅读代码,我推断此问题是由部署标题下的额外标题引起的。有什么方法解决此问题吗?
static string GetVehicleInfo(string reg)
{
var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";
// HtmlWeb is a helper class to get pages from the web
var web = new HtmlAgilityPack.HtmlWeb();
// Create an HtmlDocument from the contents found at given url
var doc = web.Load(url);
// Create an XPath to find the `b` elements which contain the registration numbers
var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
+ "/following-sibling::p[1]" // move to the first `p` element (where the actual content is in) after the header
+ "/b"; // select the `b` elements
// Get the elements from the specified XPath
var deployments = doc.DocumentNode.SelectNodes(xpath);
// Create a LINQ query to find the requested registration number and generate a result
var query =
from b in deployments // from the list of registration numbers
where b.InnerText == reg // find the registration we're looking for
select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)
// The query should yield exactly one result (or we have a problem) or none (null)
var content = query.SingleOrDefault();
// Decode the content (to convert stuff like "&" to "&")
var decoded = System.Net.WebUtility.HtmlDecode(content);
return decoded;
}