C# 使用HTML Agility Pack在两个HTML标记之间获取内容_C#_.net_Html Agility Pack

C# 使用HTML Agility Pack在两个HTML标记之间获取内容

c# .net

C# 使用HTML Agility Pack在两个HTML标记之间获取内容,c#,.net,html-agility-pack,C#,.net,Html Agility Pack,我们用Word创建了一个绝对庞大的帮助文档，它被用来生成一个更庞大、更不合适的HTM文档。使用C#和这个库，我只想在应用程序中的任何时候获取并显示这个文件的一个部分。各部分按如下方式拆分：  <div> <h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section A&l

我们用Word创建了一个绝对庞大的帮助文档，它被用来生成一个更庞大、更不合适的HTM文档。使用C#和这个库，我只想在应用程序中的任何时候获取并显示这个文件的一个部分。各部分按如下方式拆分：

<!--logical section starts here -->
<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section A</a></h1>
</div>
 <div> Lots of unnecessary markup for simple formatting... </div>
 .....
<!--logical section ends here -->

<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section B</a></h1>
</div>

因为我还没有找到关于这个的文档，所以我不知道如何从开始节点到下一个h1元素。如果您有任何建议，我们将不胜感激。

那么，您真正想要的结果是h1标签周围的div？如果是，那么这应该是可行的

helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(@name, '"+sectionName+"')]/ancestor::div");

还可根据您的Html使用

SelectNodes

。像这样：

helpDocument.DocumentNode.SelectNodes("//h1/a[starts-with(@name,'_Toc')]/ancestor::div");

哦，在测试这一点时，我注意到对我不起作用的是contains方法中的点，一旦我将其更改为name属性，一切都会正常工作。

我认为这可以做到，尽管它假设H1标记只出现在节头中。如果不是这样，您可以在子体上添加Where，以检查它找到的任何H1节点上的其他过滤器。请注意，这将包括它找到的div的所有同级，直到找到下一个具有节名称的div为止

private List<HtmlNode> GetSection(HtmlDocument helpDocument, string SectionName)
{
    HtmlNode startNode = helpDocument.DocumentNode.Descendants("div").Where(d => d.InnerText.Equals(SectionName, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
    if (startNode == null)
        return null; // section not found

    List<HtmlNode> section = new List<HtmlNode>();
    HtmlNode sibling = startNode.NextSibling;
    while (sibling != null && sibling.Descendants("h1").Count() <= 0)
    {
        section.Add(sibling);
        sibling = sibling.NextSibling;
    }

    return section;
}

私有列表GetSection（HtmlDocument帮助文档，字符串SectionName） { HtmlNode startNode=helpDocument.DocumentNode.Substands（“div”）。其中（d=>d.InnerText.Equals（SectionName，StringComparison.InvariantCultureIgnoreCase））.FirstOrDefault（）； if（startNode==null）返回null；//找不到节列表部分=新列表（）； HtmlNode同级=startNode.NextSibling；

while（sibling！=null&&sibling.subscriptions（“h1”）.Count（）不完全是。我想要

h1

标记周围的div，但是我还想得到所有未来的div/span，直到下一个

h1

标记周围的div。谢谢。很好。我不得不稍微修改过滤器，因为我在文档中有多个div，并且有节名。我最终使用了

HtmlNode startNode=helpDocument.DocumentNode.substands（“h1”）.Where（d=>d.InnerText.Contains（SectionName））.FirstOrDefault（）；

并从那里向上移动到父节点。其余部分工作正常。谢谢

private List<HtmlNode> GetSection(HtmlDocument helpDocument, string SectionName)
{
    HtmlNode startNode = helpDocument.DocumentNode.Descendants("div").Where(d => d.InnerText.Equals(SectionName, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
    if (startNode == null)
        return null; // section not found

    List<HtmlNode> section = new List<HtmlNode>();
    HtmlNode sibling = startNode.NextSibling;
    while (sibling != null && sibling.Descendants("h1").Count() <= 0)
    {
        section.Add(sibling);
        sibling = sibling.NextSibling;
    }

    return section;
}