C# 从html字符串中拆分段落并删除空段落_C#_Html_Regex

C# 从html字符串中拆分段落并删除空段落

c# html regex

C# 从html字符串中拆分段落并删除空段落,c#,html,regex,C#,Html,Regex,我有一个html字符串。我想把所有段落分割成一个数组列表。但是分割的段落不应该是空的。被拆分的段落应该包含一些普通文本，如果它只包含html文本，并且里面没有像：这样的普通文本，那么它应该被销毁或不拆分这是一个如何在html字符串中拆分段落的示例： System.Text.RegularExpressions.Match m = System.Text.RegularExpressions.Regex.Match(htmlString, @"<p>\s*(.+?)\s*</p

我有一个html字符串。我想把所有段落分割成一个数组列表。但是分割的段落不应该是空的。被拆分的段落应该包含一些普通文本，如果它只包含html文本，并且里面没有像：

这样的普通文本，那么它应该被销毁或不拆分

这是一个如何在html字符串中拆分段落的示例：

System.Text.RegularExpressions.Match m = System.Text.RegularExpressions.Regex.Match(htmlString, @"<p>\s*(.+?)\s*</p>");
ArrayList groupCollection = new ArrayList();
while (m.Success)
{
   groupCollection.Add(m.Value);
   m = m.NextMatch();
}
ArrayList paragraphs = new ArrayList();
if (groupCollection.Count > 0)
{
   foreach (object item in groupCollection)
   {
      paragraphs.Add(item);
   }
}

System.Text.RegularExpressions.Match m=System.Text.RegularExpressions.Regex.Match（htmlString，@“\s*（.+？）\s*”）；
ArrayList groupCollection=新建ArrayList（）；
while（m.Success）
{
groupCollection.Add（m.Value）；
m=m.NextMatch（）；
}
ArrayList段落=新的ArrayList（）；
如果（groupCollection.Count>0）
{
foreach（groupCollection中的对象项）
{
增加（项目）；
}
}

上面的代码可以分割所有段落，但它无法识别我上面所说的空段落。

我有自己问题的答案。这是我自己版本的代码：

System.Text.RegularExpressions.Match m = System.Text.RegularExpressions.Regex.Match(htmlString, @"<p>\s*(.+?)\s*</p>");
    ArrayList groupCollection = new ArrayList();
    while (m.Success)
    {
        groupCollection.Add(m.Value);
        m = m.NextMatch();
    }
    ArrayList paragraphs = new ArrayList();
    if (groupCollection.Count > 0)
    {
        foreach (object item in groupCollection)
        {
            try
            {
                System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex("<[^>]*>");
                // replace all matches with empty string
                string str = rx.Replace(item.ToString(), "");
                string str1 = str.Replace("&nbsp;", "");
                if (!String.IsNullOrEmpty(str1))
                {
                    paragraphs.Add(item.ToString());
                }
            }
            catch
            {
                //This try-catch just prevent future error.
            }
        }
    }

System.Text.RegularExpressions.Match m=System.Text.RegularExpressions.Regex.Match（htmlString，@“\s*（.+？）\s*”）；
ArrayList groupCollection=新建ArrayList（）；
while（m.Success）
{
groupCollection.Add（m.Value）；
m=m.NextMatch（）；
}
ArrayList段落=新的ArrayList（）；
如果（groupCollection.Count>0）
{
foreach（groupCollection中的对象项）
{
尝试
{
System.Text.RegularExpressions.Regex rx=新的System.Text.RegularExpressions.Regex（“]*>”）；
//用空字符串替换所有匹配项
字符串str=rx.Replace（item.ToString（），“”）；
字符串str1=str.Replace（“，”）；
如果（！String.IsNullOrEmpty（str1））
{
添加（item.ToString（））；
}
}
抓住
{
//此尝试捕获仅用于防止将来出现错误。
}
}
}

在上面的代码上。您可以看到，我首先删除段落中的所有html标记，然后替换html字符串中的所有空标记。这将帮助我识别空段落。

您尝试了什么。？我尝试了正则表达式从html字符串中拆分所有段落。但是我不确定它是空的。你能用问题发布你的代码吗？。如果我使用正则表达式，那么它只会帮助我拆分段落，但是如果有一些文本像：

，那么我不知道如何删除这些空html标记。[这可能会对你有所帮助。这与你的问题一样。。。