C# 移除未关闭的开口<；p>；来自xhtml文档的标记_C#_Html_Xml_Regex_Xhtml

C# 移除未关闭的开口<；p>；来自xhtml文档的标记

c# html xml regex

C# 移除未关闭的开口<；p>；来自xhtml文档的标记,c#,html,xml,regex,xhtml,C#,Html,Xml,Regex,Xhtml,我有一个很大的xhtml文档，里面有很多标签。我注意到一些未关闭的开始段落标记重复了不必要的内容，我想删除它们或用空格替换它们。我只想编码以识别未关闭的段落标记并删除它们这里有一个小样本来说明我的意思： Company Registration No.1 Company Registration No.2</p&g

我有一个很大的xhtml文档，里面有很多标签。我注意到一些未关闭的开始段落标记重复了不必要的内容，我想删除它们或用空格替换它们。我只想编码以识别未关闭的段落标记并删除它们

这里有一个小样本来说明我的意思：

<p><strong>Company Registration No.1</strong> </p>
<p><strong>Company Registration No.2</strong></p>

<p>      <!-- extra tag -->
<p>      <!-- extra tag -->

<hr/>     

<p><strong> HALL WOOD (LEEDS) LIMITED</strong><br/></p>
<p><strong>REPORT AND FINANCIAL STATEMENTS </strong></p>

公司注册号1
公司注册号2



霍尔伍德（利兹）有限公司
报告和财务报表

有人能给我控制台应用程序的代码，只是为了删除这些未关闭的段落标记。

您必须了解，创建了什么样的DOM树。它可以被解释为

<p><strong>Company Registration No.1</strong> </p>
<p><strong>Company Registration No.2</strong></p>

<p>      <!-- extra tag -->
  <p>      <!-- extra tag -->
    <hr/>     
    <p><strong> HALL WOOD (LEEDS) LIMITED</strong><br/></p>
    <p><strong>REPORT AND FINANCIAL STATEMENTS </strong></p>
  </p>
</p>

公司注册号1
公司注册号2



霍尔伍德（利兹）有限公司
报告和财务报表

或

公司注册号1
公司注册号2



霍尔伍德（利兹）有限公司
报告和财务报表

您可以尝试查找嵌套的p标记，并将内部内容移动到外部p标记，然后删除保留为空的内部p标记。无论如何，我认为您需要首先分析DOM树。

这应该可以：

public static class XHTMLCleanerUpperThingy
{
    private const string p = "<p>";
    private const string closingp = "</p>";

    public static string CleanUpXHTML(string xhtml)
    {
        StringBuilder builder = new StringBuilder(xhtml);
        for (int idx = 0; idx < xhtml.Length; idx++)
        {
            int current;
            if ((current = xhtml.IndexOf(p, idx)) != -1)
            {
                int idxofnext = xhtml.IndexOf(p, current + p.Length);
                int idxofclose = xhtml.IndexOf(closingp, current);

                // if there is a next <p> tag
                if (idxofnext > 0)
                {
                    // if the next closing tag is farther than the next <p> tag
                    if (idxofnext < idxofclose)
                    {
                        for (int j = 0; j < p.Length; j++)
                        {
                            builder[current + j] = ' ';
                        }
                    }
                }
                // if there is not a final closing tag
                else if (idxofclose < 0)
                {
                    for (int j = 0; j < p.Length; j++)
                    {
                        builder[current + j] = ' ';
                    }
                }
            }
        }

        return builder.ToString();
    }
}

公共静态类XHTMLCleanerUpperThingy
{
私有常量字符串p=“”；
私有常量字符串closingp=“”；
公共静态字符串清理xhtml（字符串xhtml）
{
StringBuilder=新的StringBuilder（xhtml）；
for（intidx=0；idx0）
{
//如果下一个结束标记比下一个标记更远
if（idxofnext


我已经用你的示例测试了它，它是有效的…虽然它对于一个算法来说是一个糟糕的公式，但它应该给你一个开始的基础
 一个
不允许（官方）包含另一个
，因此不太可能进行第一次解释。当看到第二个孤岛时，它意味着关闭第一个孤岛。
有点棘手。我打赌该部分的DOM看起来像../strong>
public static class XHTMLCleanerUpperThingy
{
    private const string p = "<p>";
    private const string closingp = "</p>";

    public static string CleanUpXHTML(string xhtml)
    {
        StringBuilder builder = new StringBuilder(xhtml);
        for (int idx = 0; idx < xhtml.Length; idx++)
        {
            int current;
            if ((current = xhtml.IndexOf(p, idx)) != -1)
            {
                int idxofnext = xhtml.IndexOf(p, current + p.Length);
                int idxofclose = xhtml.IndexOf(closingp, current);

                // if there is a next <p> tag
                if (idxofnext > 0)
                {
                    // if the next closing tag is farther than the next <p> tag
                    if (idxofnext < idxofclose)
                    {
                        for (int j = 0; j < p.Length; j++)
                        {
                            builder[current + j] = ' ';
                        }
                    }
                }
                // if there is not a final closing tag
                else if (idxofclose < 0)
                {
                    for (int j = 0; j < p.Length; j++)
                    {
                        builder[current + j] = ' ';
                    }
                }
            }
        }

        return builder.ToString();
    }
}