C# 在使用HtmlAlityPack进行屏幕抓取时,是否可以同时搜索多个标记类型?
尽管仍处于可延展状态,但此代码仍然有效:C# 在使用HtmlAlityPack进行屏幕抓取时,是否可以同时搜索多个标记类型?,c#,linq,screen-scraping,html-agility-pack,linq-to-objects,C#,Linq,Screen Scraping,Html Agility Pack,Linq To Objects,尽管仍处于可延展状态,但此代码仍然有效: public List<string> GetParagraphsListFromHtml(string sourceHtml) { var pars = new List<string>(); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(sourceHtml); var g
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load("http://www.montereycountyweekly.com/opinion/letters/article_e333a222-942d-11e3-ba9c-001a4bcf6878.html");
var pTags = document.DocumentNode.SelectNodes("//p");
int counter = 1;
if (pTags != null)
{
foreach (var pTag in pTags)
{
pars.Add(pTag.InnerText);
MessageBox.Show(pTag.InnerText);
counter++;
}
}
MessageBox.Show("done!");
return pars;
}
…或LINQified版本,例如:
foreach (var par in doc.DocumentNode
.DescendantNodes()
.Single(x => x.Id == "body")
.DescendantNodes()
.Where(x => x.Name == "h1" || x.Name == "h2" || x.Name == "h3" || x.Name == "hp" || ))
?我认为这可能适合您:
doc.DocumentNode.ChildNodes.Where(x => (x.NodeType == HtmlNodeType.Text));
这将捕获所有文本元素。不幸的是,没有,在上面显示的示例页面中,消息框会显示几次空字符串(无),仅此而已。
doc.DocumentNode.ChildNodes.Where(x => (x.NodeType == HtmlNodeType.Text));