C# 使用HTMLAgilityPack解析HTML_C#_Html Agility Pack

C# 使用HTMLAgilityPack解析HTML

C# 使用HTMLAgilityPack解析HTML,c#,html-agility-pack,C#,Html Agility Pack,我有下面的HTML，我正试图用HTML敏捷包解析它这是代码返回的整个文件的一个片段： <div class="story-body fnt-13 p20-b user-gen"> <p>text here text here text </p> <p>text here text here text text here text here text text here text here text text here text h

我有下面的HTML，我正试图用HTML敏捷包解析它

这是代码返回的整个文件的一个片段：

<div class="story-body fnt-13 p20-b user-gen">
    <p>text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <div  class="gallery clr bdr aln-c js-no-shadow mod  cld">
        <div>
            <ol>
                <li class="fader-item aln-c ">
                    <div class="imageWrap m10-b">
                       &#8203;<img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" />
                    </div>
                    <p class="caption">caption text</p>
                </li>
            </ol>
        </div>
    </div >
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
</div>

代码循环遍历每个p，现在将其附加到一个文本框中。除了使用类库clr bdr aln-c js no shadow mod cld的div标记外，其他所有标记都正常工作。这段HTML的结果是我得到了&8203；和标题文本位

什么是从结果中省略这个的最好方法？

不太清楚你在问什么。我想你是在问如何只获取特定div的直接后代。如果是这样，那么使用ChildNodes而不是后代。即:

.SelectMany(div => div.ChildNodes().Where(n => n.Name == "p"))

问题是，子体对文档树进行完全递归遍历。

不太清楚您在问什么。我想你是在问如何只获取特定div的直接后代。如果是这样，那么使用ChildNodes而不是后代。即:

.SelectMany(div => div.ChildNodes().Where(n => n.Name == "p"))

问题是子体对文档树进行完全递归遍历。

XPATH是您的朋友。试试这个，忘掉那个蹩脚的xlink语法：-

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    Console.WriteLine(node.InnerText.Trim());
}

此表达式将选择未设置任何属性的所有p节点。其他示例请参见此处：

XPATH是您的朋友。试试这个，忘掉那个蹩脚的xlink语法：-

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    Console.WriteLine(node.InnerText.Trim());
}

此表达式将选择未设置任何属性的所有p节点。其他示例请参见此处：

Psst…所以有两个问题，从结果中省略该选项的最佳方法是什么？这是一个问题，另一个是什么？我不知道你在说什么pPsst…所以有两个问题，从结果中省略它的最佳方式是什么？这是一个问题，另一个是什么？我不知道你在说什么peasier wold将使用xpath://p，它将包括

标题文本。我尽量不包括从第4行到第15行的任何div，只包括另一行s@Nathan：不，我想不包括那些。ChildNodes仅获取特定节点的直接后代。如果您将LINQ表达式中的SelectMany替换为my SelectMany，我想您会发现它会像广告中所宣传的那样工作。我的表达式使用Where，因为没有ChildNodes重载来指定类型，即不能说ChildNodesp。好的，我想我理解你的意思。像下面这样？var links=document.DocumentNode.genderantsDiv.Wherediv=>div.GetAttributeValueclass.Containsstory body fnt-13 p20-b user gen/.SelectManydiv=>div.ChildNodes.Wheren=>n.Name==p.ToList；xpath://p将包含

标题文本，这将更容易使用。我尽量不包括从第4行到第15行的任何div，只包括另一行s@Nathan：不，我想不包括那些。ChildNodes仅获取特定节点的直接后代。如果您将LINQ表达式中的SelectMany替换为my SelectMany，我想您会发现它会像广告中所宣传的那样工作。我的表达式使用Where，因为没有ChildNodes重载来指定类型，即不能说ChildNodesp。好的，我想我理解你的意思。像下面这样？var links=document.DocumentNode.genderantsDiv.Wherediv=>div.GetAttributeValueclass.Containsstory body fnt-13 p20-b user gen/.SelectManydiv=>div.ChildNodes.Wheren=>n.Name==p.ToList；感谢这是一个不错的解决方案，我们也会研究xpath，因为它看起来是一个更好的解决方案！它确实可以工作，但它还包括页面上的其他P节点。这只是在顶部发布的一个片段。只需在表达式中添加其他筛选器，[和]字符之间的内容使用xpath可以从特定div和特定类中获取节点集合。i、我当然知道。SelectNodes//div[@class='story']将使用具有“story”值的“class”属性从根目录获取所有div。多亏了这一点，我们还将研究xpath，因为它看起来是一个更好的解决方案！它确实可以工作，但它还包括页面上的其他P节点。这只是在顶部发布的一个片段。只需在表达式中添加其他筛选器，[和]字符之间的内容使用xpath可以从特定div和特定类中获取节点集合。i、我当然知道。SelectNodes//div[@class='story']将使用具有“story”值的“class”属性从根目录获取所有div。