C# 带有AngleSharp的HTML解析器-IEElement中的文本
我正在用AngleSharp编写一个HTML解析器,它应该像这样输入HTML:C# 带有AngleSharp的HTML解析器-IEElement中的文本,c#,html,parsing,anglesharp,C#,Html,Parsing,Anglesharp,我正在用AngleSharp编写一个HTML解析器,它应该像这样输入HTML: <p> Paragraph Text <a href="https://www.example com" class="external text" target="_new" rel="nofollow">Link Text</a> Paragraph Text 2 </p> <p> Paragraph Text <a href="https://
<p>
Paragraph Text
<a href="https://www.example com" class="external text" target="_new" rel="nofollow">Link Text</a>
Paragraph Text 2
</p>
<p>
Paragraph Text
<a href="https://www.example com">Link Text</a>
Paragraph Text 2
</p>
段落文本
第2段案文
并按如下方式输出:
<p>
Paragraph Text
<a href="https://www.example com" class="external text" target="_new" rel="nofollow">Link Text</a>
Paragraph Text 2
</p>
<p>
Paragraph Text
<a href="https://www.example com">Link Text</a>
Paragraph Text 2
</p>
段落文本
第2段案文
我编写此递归函数是为了浏览整个文档:
using AngleSharp.Dom;
using AngleSharp.Dom.Html;
using AngleSharp.Extensions;
using AngleSharp.Parser.Html;
private void processHTMLNode(IElement node, IElement targetNode)
{
switch (node.NodeName.ToLower())
{
//...
case "a":
if(node.HasAttribute("href") && node.GetAttribute("href").StartsWith("#"))
{
break;
}
var aNew = outputDocument.CreateElement("a");
aNew.SetAttribute("href", node.GetAttribute("href"));
aNew.TextContent = node.TextContent;
targetNode.AppendChild(aNew);
break;
case "p":
var pNew = outputDocument.CreateElement<IHtmlParagraphElement>();
foreach (var childNode in node.Children)
{
processHTMLNode(childNode, pNew);
}
//TODO fix this
pNew.TextContent = node.TextContent;
targetNode.AppendChild(pNew);
break;
}
//...
}
使用AngleSharp.Dom;
使用AngleSharp.Dom.Html;
使用AngleSharp.Extensions;
使用AngleSharp.Parser.Html;
私有void processHTMLNode(IEElement节点,IEElement目标节点)
{
开关(node.NodeName.ToLower())
{
//...
案例“a”:
if(node.HasAttribute(“href”)和&node.GetAttribute(“href”).StartsWith(“#”)
{
打破
}
var aNew=outputDocument.CreateElement(“a”);
一个新的.SetAttribute(“href”,node.GetAttribute(“href”);
new.TextContent=node.TextContent;
targetNode.AppendChild(重新);
打破
案例“p”:
var pNew=outputDocument.CreateElement();
foreach(node.Children中的var childNode)
{
processHTMLNode(childNode,pNew);
}
//该怎么办
pNew.TextContent=node.TextContent;
targetNode.AppendChild(pNew);
打破
}
//...
}
问题是,设置TextContent
属性会覆盖a
-元素,这些元素是p
-节点的子元素。订单(文本->链接->文本)也丢失
如何正确地实现这一点?好的,因此我使用以下代码解决了我的问题:
using AngleSharp.Dom;
using AngleSharp.Dom.Html;
using AngleSharp.Extensions;
using AngleSharp.Parser.Html;
private void processHTMLNode(INode node, IElement targetElement)
{
IElement elementNode;
IText textNode;
if ((elementNode = node as IElement) != null)
{
switch (node.NodeName.ToLower())
{
//...
case "a":
if(node.HasAttribute("href") && node.GetAttribute("href").StartsWith("#"))
{
break;
}
var aNew = outputDocument.CreateElement("a");
aNew.SetAttribute("href", node.GetAttribute("href"));
foreach (var childNode in elementNode.ChildNodes)
{
processHTMLNode(childNode, aNew);
}
targetElement.AppendChild(aNew);
break;
case "p":
var pNew = outputDocument.CreateElement("p");
foreach (var childNode in node.Children)
{
processHTMLNode(childNode, pNew);
}
targetElement.AppendChild(pNew);
break;
//...
}
}
else if ((textNode = node as IText) != null)
{
var newTextNode = outputDocument.CreateTextNode(textNode.Text);
targetElement.AppendChild(newTextNode);
}
}
这张来自AngleSharp文档的图片对我帮助很大:
好的,所以我使用以下代码解决了我的问题:
using AngleSharp.Dom;
using AngleSharp.Dom.Html;
using AngleSharp.Extensions;
using AngleSharp.Parser.Html;
private void processHTMLNode(INode node, IElement targetElement)
{
IElement elementNode;
IText textNode;
if ((elementNode = node as IElement) != null)
{
switch (node.NodeName.ToLower())
{
//...
case "a":
if(node.HasAttribute("href") && node.GetAttribute("href").StartsWith("#"))
{
break;
}
var aNew = outputDocument.CreateElement("a");
aNew.SetAttribute("href", node.GetAttribute("href"));
foreach (var childNode in elementNode.ChildNodes)
{
processHTMLNode(childNode, aNew);
}
targetElement.AppendChild(aNew);
break;
case "p":
var pNew = outputDocument.CreateElement("p");
foreach (var childNode in node.Children)
{
processHTMLNode(childNode, pNew);
}
targetElement.AppendChild(pNew);
break;
//...
}
}
else if ((textNode = node as IText) != null)
{
var newTextNode = outputDocument.CreateTextNode(textNode.Text);
targetElement.AppendChild(newTextNode);
}
}
这张来自AngleSharp文档的图片对我帮助很大: