C# 删除两个元素之间的所有元素_C#_Html_Xpath_Html Agility Pack

C# 删除两个元素之间的所有元素

c# html xpath

C# 删除两个元素之间的所有元素,c#,html,xpath,html-agility-pack,C#,Html,Xpath,Html Agility Pack,我有大约2500个不同标准的html文件。我需要删除它们的页脚部分。下面的HTML代码是我的一个文件页脚，我需要删除两个hr元素以及两者之间的元素到目前为止，我只尝试使用xpath（和HTML Agility Pack）selectSingleNode和DocumentNode.SelectNodes（“//hr”）。然后尝试使用foreach进行迭代。但是我太笨了，无法正确使用XPath，也不知道如何选择节点及其同级（？）来删除它们这就是我在社区的帮助下，到目前为止所得到的。：）它查找

我有大约2500个不同标准的html文件。我需要删除它们的页脚部分。下面的HTML代码是我的一个文件页脚，我需要删除两个hr元素以及两者之间的元素

到目前为止，我只尝试使用xpath（和HTML Agility Pack）

selectSingleNode

和

DocumentNode.SelectNodes（“//hr”）。然后尝试使用foreach进行迭代。
但是我太笨了，无法正确使用XPath，也不知道如何选择节点及其同级（？）来删除它们
这就是我在社区的帮助下，到目前为止所得到的。：）
它查找文本为“How to cite this”的段落，然后选择该段落之间的所有节点，并使用颜色“ff00ff”向下搜索hr。但不包括要删除的列表中实际选定的节点，它们需要与选定的节点一起删除。
我想，您希望这样
代码
string content = System.IO.File.ReadAllText(@"D:\New Text Document.txt");
string html = Regex.Replace(content, "<hr.*?>", "", RegexOptions.Singleline);

string content=System.IO.File.ReadAllText（@“D:\New Text Document.txt”）；
字符串html=Regex.Replace（内容，“，”，RegexOptions.Singleline）；

结果
//start element
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<p style="text-align : left; color : black;">Ekmek&ccedil;ioglu, F. &Ccedil;una, Lynch, Michael F. &amp; Willett, Peter   (1996)&nbsp; &quot;Stemming and N-gram matching for term conflation in Turkish texts&quot;&nbsp;<em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style="text-align : center">&copy; the authors, 1996.</p>
<div align="center">Check for citations, <a href="http://scholar.google.co.uk/scholar?hl=en&amp;q=http://informationr.net/ir/2-2/paper13.html&amp;btnG=Search&amp;as_sdt=2000">using Google Scholar</a></div>

<table border="0" cellpadding="15" cellspacing="0" align="center">
<tr> 
    <td><a href="infres22.html"><h4>Contents</h4></a></td>
    <td align="center" valign="top"><h5 align="center"><IMG SRC="http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13" ALIGN=middle  WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href="http://www.digits.net/ ">Web Counter</a><br>Counting only since 13 December 2002</h5></td>
    <td><a href="http://InformationR.net/ir/"><h4>Home</h4></a></td>
</tr>
</table>
 //end element

//开始元素
如何引用本文：
Ekmekç；ioglu，F.和Ccedil；尤纳、林奇、迈克尔·F.&；Willett，Peter（1996）“土耳其语文本中术语合并的词干和N-gram匹配”信息研究，1（1）可从以下网址获得：http://informationr.net/ir/2-2/paper13.html
&复制；提交人，1996年
检查引证，


仅从2002年12月13日起计算
//末端元件
假设开始节点和结束节点与您在上面的评论中提到的真正相同（相同的标记名、属性和属性值），这并不难：
选择开始节点
迭代并删除每个兄弟节点，直至结束节点，包括结束节点
删除开始节点
示例HTML：
var html =
@"<!doctype html system 'html.dtd'>
<html><head></head>
<body>

<div>DO NOT DELETE</div>

<hr color=""#ff00ff"" SIZE='3'> //start element
<p style='text-align : center; color : Red; font-weight : bold;'>How to cite this paper:</i></p>
<p style='text-align : left; color : black;'>Ekmek&ccedil;ioglu, F. &Ccedil;una, Lynch, Michael F. &amp; Willett, Peter   (1996)&nbsp; &quot;Stemming and N-gram matching for term conflation in Turkish texts&quot;&nbsp;<em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style='text-align : center'>&copy; the authors, 1996.</p>
<hr color='#ff00ff' size='1'><div align='center'>Check for citations, <a href='http://scholar.google.co.uk/scholar?hl=en&amp;q=http://informationr.net/ir/2-2/paper13.html&amp;btnG=Search&amp;as_sdt=2000'>using Google Scholar</a></div>
                                 <hr color='#ff00ff' size='1'>
<table border='0' cellpadding='15' cellspacing='0' align='center'>
<tr> 
    <td><a href='infres22.html'><h4>Contents</h4></a></td>
    <td align='center' valign='top'><h5 align='center'><IMG SRC='http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13' ALIGN=middle  WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href='http://www.digits.net/'>Web Counter</a><br>Counting only since 13 December 2002</h5></td>
    <td><a href='http://InformationR.net/ir/'><h4>Home</h4></a></td>
</tr>
</table>
<hr COLOR='#ff00ff' SIZE=""3""> //end element

<div>DO NOT DELETE</div>
</body>
</html>";

结果输出：
<!doctype html system 'html.dtd'>
<html><head></head>
<body>

<div>DO NOT DELETE</div>

 //end element

<div>DO NOT DELETE</div>
</body>
</html>


不删除
//末端元件
不删除
那么，运行上述代码时会发生什么情况？你有错误吗？它不会删除节点吗？您是否考虑过改用正则表达式？您可能可以创建一个与文件结尾匹配的模式。@ryan wilson是的，确实可以，但是如上所述，我需要一些帮助来删除两个节点之间的所有内容。不幸的是，这两个节点看起来相同，我不知道如何正确使用XPath。旁注，在这里粘贴代码时，如何正确格式化？有指南吗？在开始和结束元素之间有一个
节点。那么这是否意味着开始和结束元素必须具有相同的颜色
和大小
属性值？这就是问题所在，开始和结束
-节点看起来相同，但我也想删除它们。我在上面的SelectNodes代码中确实取得了一些成功，但它并没有删除它选择的实际节点。我想删除开始和结束节点之间的所有内容，包括开始和结束。不确定这是否可行，也许我需要将其分为不同的删除方法？是的，和否。是的，我想删除hr元素，但我也想删除这两者之间的所有元素。因此，开始元素和结束元素之间的所有html元素。该文档包含更多内容（此处未列出），我不想错误地删除任何其他元素。因此，您的解决方案将删除文档中必须保留的内容。//文本中的起始元素和//end元素是否需要在文档中清除？//起始元素和//end元素只是说明本例中起始和结束位置的注释。原始文件中不存在这些注释文本。开始元素和结束元素的类型相同，'''效果很好，我必须一路修改它。在我的文档中，这些人力资源要素似乎很棘手。最初的作者使用了旧的标准，并没有超持久性。但是您的代码用这些精确的属性标识hr，并删除它们之间的内容。非常感谢！：）
//start element
<p style="text-align : center; color : Red; font-weight : bold;">How to cite this paper:</i></p>
<p style="text-align : left; color : black;">Ekmek&ccedil;ioglu, F. &Ccedil;una, Lynch, Michael F. &amp; Willett, Peter   (1996)&nbsp; &quot;Stemming and N-gram matching for term conflation in Turkish texts&quot;&nbsp;<em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style="text-align : center">&copy; the authors, 1996.</p>
<div align="center">Check for citations, <a href="http://scholar.google.co.uk/scholar?hl=en&amp;q=http://informationr.net/ir/2-2/paper13.html&amp;btnG=Search&amp;as_sdt=2000">using Google Scholar</a></div>

<table border="0" cellpadding="15" cellspacing="0" align="center">
<tr> 
    <td><a href="infres22.html"><h4>Contents</h4></a></td>
    <td align="center" valign="top"><h5 align="center"><IMG SRC="http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13" ALIGN=middle  WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href="http://www.digits.net/ ">Web Counter</a><br>Counting only since 13 December 2002</h5></td>
    <td><a href="http://InformationR.net/ir/"><h4>Home</h4></a></td>
</tr>
</table>
 //end element

var html =
@"<!doctype html system 'html.dtd'>
<html><head></head>
<body>

<div>DO NOT DELETE</div>

<hr color=""#ff00ff"" SIZE='3'> //start element
<p style='text-align : center; color : Red; font-weight : bold;'>How to cite this paper:</i></p>
<p style='text-align : left; color : black;'>Ekmek&ccedil;ioglu, F. &Ccedil;una, Lynch, Michael F. &amp; Willett, Peter   (1996)&nbsp; &quot;Stemming and N-gram matching for term conflation in Turkish texts&quot;&nbsp;<em>Information Research</em>, <strong>1</strong>(1) Available at: http://informationr.net/ir/2-2/paper13.html</p>
<p style='text-align : center'>&copy; the authors, 1996.</p>
<hr color='#ff00ff' size='1'><div align='center'>Check for citations, <a href='http://scholar.google.co.uk/scholar?hl=en&amp;q=http://informationr.net/ir/2-2/paper13.html&amp;btnG=Search&amp;as_sdt=2000'>using Google Scholar</a></div>
                                 <hr color='#ff00ff' size='1'>
<table border='0' cellpadding='15' cellspacing='0' align='center'>
<tr> 
    <td><a href='infres22.html'><h4>Contents</h4></a></td>
    <td align='center' valign='top'><h5 align='center'><IMG SRC='http://counter.digits.net/wc/-d/-z/6/-b/FF0033/paper13' ALIGN=middle  WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 VSPACE=2><br><a href='http://www.digits.net/'>Web Counter</a><br>Counting only since 13 December 2002</h5></td>
    <td><a href='http://InformationR.net/ir/'><h4>Home</h4></a></td>
</tr>
</table>
<hr COLOR='#ff00ff' SIZE=""3""> //end element

<div>DO NOT DELETE</div>
</body>
</html>";

var document = new HtmlDocument();
document.LoadHtml(html);
var startNode = document.DocumentNode.SelectSingleNode("//hr[@size='3'][@color='#ff00ff']");
// account for mismatched quotes in HTML source
var quotesRegex = new Regex("[\"']");
var startNodeNoQuotes = quotesRegex.Replace(startNode.OuterHtml, "");
HtmlNode siblingNode;

while ( (siblingNode = startNode.NextSibling) != null)
{
    siblingNode.Remove();
    if (quotesRegex.Replace(siblingNode.OuterHtml, "") == startNodeNoQuotes)
    {
        break;  // end node
    }
}

startNode.Remove();

<!doctype html system 'html.dtd'>
<html><head></head>
<body>

<div>DO NOT DELETE</div>

 //end element

<div>DO NOT DELETE</div>
</body>
</html>