C# 使用htmldocument/HtmlAgilityPack获取所有节点及其内容_C#_Html_Uwp_Html Agility Pack

C# 使用htmldocument/HtmlAgilityPack获取所有节点及其内容

c# html uwp

C# 使用htmldocument/HtmlAgilityPack获取所有节点及其内容,c#,html,uwp,html-agility-pack,C#,Html,Uwp,Html Agility Pack,我需要从html中获取所有节点，然后从这些节点中获取文本和子节点，同样的，但是从这些子节点。例如，我有以下HTML： This is a <a href="">Link</a> with bold 这是带有粗体的因此，我需要一种方法来获取p节点，然后是非格式化文本（此），唯一的粗体文本（是一个），粗体链接（），以及其余格式化和非格式化文本我知道，通过

我需要从html中获取所有节点，然后从这些节点中获取文本和子节点，同样的，但是从这些子节点。例如，我有以下HTML：

<p>This <b>is a <a href="">Link</a></b> with <b>bold</b></p>

这是带有粗体的

因此，我需要一种方法来获取p节点，然后是非格式化文本（此），唯一的粗体文本（是一个），粗体链接（），以及其余格式化和非格式化文本
我知道，通过htmldocument，我可以选择所有节点和子节点，但是，我如何获得子节点之前的文本，然后是子节点，以及它的文本/子节点，以便生成html的渲染版本（“此是一个带有粗体的”）
请注意，上面的例子很简单。HTML将包含更复杂的内容，如列表、框架、编号列表、三重格式文本等。还请注意，呈现的内容不是问题。我已经这样做了，但用另一种方式。我需要的是只获取节点及其内容的部分。
而且，我不能忽略任何节点，所以我不能什么都不过滤。主节点可以从p、div、frame、ul等开始。
在查看htmldoc及其属性后，由于@HungCao的观察，我找到了一种解释HTML代码的简单方法
我的代码稍微复杂一些，添加它作为示例，所以我将发布它的精简版本
首先，必须加载htmlDoc。它可以在任何功能上：

HtmlDocument htmlDoc = new HtmlDocument(); string html = @"This is a <a href="""">Link</a> with bold"; htmlDoc.LoadHtml(html);
请注意，有一个名为“NodeType”的属性，但它不会返回正确的类型。因此，改用“Name”属性（还要注意，htmlNode中的Name属性与HTML中的Name属性不同）
最后，我们有一个InterNode函数，它将为引用的（ref）段落添加内联线

public bool InterNode(HtmlNode htmlNode, ref Paragraph originalPar) { string htmlNodeName = htmlNode.Name.ToLower(); List<string> nodeAttList = new List<string>(); HtmlNode parentNode = htmlNode.ParentNode; while (parentNode != null) { nodeAttList.Add(parentNode.Name); parentNode = parentNode.ParentNode; } //we need to get it multiple types, because it could be b(old) and i(talic) at the same time. Inline newRun = new Run(); foreach (string noteAttStr in nodeAttList) //with this we can set all the attributes to the inline { switch (noteAttStr) { case ("b"): case ("strong"): { newRun.FontWeight = FontWeights.Bold; break; } case ("i"): case ("em"): { newRun.FontStyle = FontStyle.Italic; break; } } } if(htmlNodeName == "#text") //the #text means that its a text node. Like <#text/>. Thanks @HungCao { ((Run)newRun).Text = htmlNode.InnerText; } else //if it is not a #text, don't load its innertext, as it's another node and it will always have a #text node as a child (if it has any text) { foreach (HtmlNode childNode in htmlNode.ChildNodes) { InterNode(childNode, ref originalPar); } } return true; }

public bool节间（HtmlNode HtmlNode，参考段落originalPar） { 字符串htmlNodeName=htmlNode.Name.ToLower（）； List nodeAttList=新列表（）； HtmlNode parentNode=HtmlNode.parentNode； while（parentNode！=null）{ 添加（parentNode.Name）； parentNode=parentNode.parentNode； }//我们需要获得多个类型，因为它可以同时是b（旧）和i（talic）。内联newRun=新运行（）； foreach（nodeAttList中的string noteAttStr）//使用此函数，我们可以将所有属性设置为内联 { 开关（noteAttStr） { 案例（“b”）：案例（“强”）： { newRun.fontwweight=fontwweights.Bold；打破 } 个案（i）：案例（“em”）： { newRun.FontStyle=FontStyle.Italic；打破 } } } if（htmlNodeName==“#text”）//这个#text表示它是一个文本节点。比如。谢谢@HungCao { （（Run）newRun）.Text=htmlNode.InnerText； }else//如果它不是#文本，请不要加载其内部文本，因为它是另一个节点，并且它将始终有一个#文本节点作为子节点（如果它有任何文本） { foreach（HtmlNode.ChildNodes中的HtmlNode childNode） { 节间（子节，参考原节）； } } 返回true； }

注意：我知道我说过我的应用程序需要以webview的另一种方式呈现HTML，我知道这个示例代码生成的内容与webview相同，但正如我前面所说的，这只是我最终代码的精简版本。事实上，我的原始/完整代码正在按照我的需要工作，这只是基础。
你看了吗？如果你谈论哈桑的评论，是的，我尝试了一个将HTML转换为XAML的存储库，但遗憾的是，我的应用程序没有那么基本。例如，如果有一个带有X类的div，我需要显示一个图像，或者如果href中有一个带有特定域的div，我需要单击调用一些函数。如注释中所述，没有现成的解决方案。您需要构建自己的解析器，根据HTML标记逐个手动完成所有工作。另一件与您的要求相近的事情是，我知道没有像“htmldoc.toxaml（）；”这样简单的方法，但我要求的是类似于节点列表的东西，但包括未格式化的部分。我的意思是：你可以得到主/根节点和它的子节点，从这个子节点你可以得到它的子节点，然后继续。但是，如果我是正确的，child只是格式化的节点（粗体、a、ul等）。同样，我不是要一个2行的解决方案，而是要一个使用htmldocument/HTMLAGILITypack改进实际代码（2300行，但有很多解释错误）的最佳方法。通过查找“#text”节点，您可以在任何节点之前获取文本。在您的示例中，它将是类似的。如果你还是不明白，请告诉我
public bool InterNode(HtmlNode htmlNode, ref Paragraph originalPar) { string htmlNodeName = htmlNode.Name.ToLower(); List<string> nodeAttList = new List<string>(); HtmlNode parentNode = htmlNode.ParentNode; while (parentNode != null) { nodeAttList.Add(parentNode.Name); parentNode = parentNode.ParentNode; } //we need to get it multiple types, because it could be b(old) and i(talic) at the same time. Inline newRun = new Run(); foreach (string noteAttStr in nodeAttList) //with this we can set all the attributes to the inline { switch (noteAttStr) { case ("b"): case ("strong"): { newRun.FontWeight = FontWeights.Bold; break; } case ("i"): case ("em"): { newRun.FontStyle = FontStyle.Italic; break; } } } if(htmlNodeName == "#text") //the #text means that its a text node. Like <#text/>. Thanks @HungCao { ((Run)newRun).Text = htmlNode.InnerText; } else //if it is not a #text, don't load its innertext, as it's another node and it will always have a #text node as a child (if it has any text) { foreach (HtmlNode childNode in htmlNode.ChildNodes) { InterNode(childNode, ref originalPar); } } return true; }