C# 为什么要将正则表达式拆分添加到模式\r\n_C#_Regex_Split

C# 为什么要将正则表达式拆分添加到模式\r\n

c# regex

C# 为什么要将正则表达式拆分添加到模式\r\n,c#,regex,split,C#,Regex,Split,我想按HTMLdiv标记分割文章的主体，这样我就有了一个搜索div的模式。问题是模式也被拆分\r\n [在此处输入图像描述][1] string pattern = @"<div[^<>]*>(.*?)</div>"; string[] bodyParagraphsnew = Regex.Split(body, pattern,RegexOptions.None); Response.Write("num of paragraph =" + bodyPara

我想按HTMLdiv标记分割文章的主体，这样我就有了一个搜索div的模式。问题是模式也被拆分\r\n [在此处输入图像描述][1]

string pattern = @"<div[^<>]*>(.*?)</div>";
string[] bodyParagraphsnew = Regex.Split(body, pattern,RegexOptions.None);
Response.Write("num of paragraph =" + bodyParagraphsnew.Length);
for (int i = 0; i < bodyParagraphsnew.Length; i++)
{
    Response.Write("bodyParagraphs" + i + "= " + bodyParagraphsnew[i]+ Environment.NewLine);
}

string模式=@“（.*？”；
string[]bodyParagraphsnew=Regex.Split（body、pattern、RegexOptions.None）；
响应。写入（“段落数=”+正文段落新长度）；
对于（int i=0；i


调试此代码时，我在数组bodyParagraphsnew中看到许多“\r\n”
可以看到模式包含由字符串“\r\n”拆分的内容
我尝试将\r\n替换为字符串为空，希望BodyParagraphs的新长度会更改。但不是。我得到了包含“”的而不是项（在数组中）
为什么?
下面是解释问题的图像链接
您看到的是第一个标记末尾和下一个标记开头之间的文本。这就是它所做的，它在正则表达式匹配之间查找文本
然而，这里奇怪的是，您还将获得打开和关闭标记之间的文本，因为您将括号放在字符串中，形成一个字符串。考虑下面的程序：
using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main(string[] args)
    {
        string body = "<div>some text</div>\r\n<div>some more text</div>";

        string pattern = @"<div[^>]*?>(.*?)</div>";
        string[] bodyParagraphsnew = Regex.Split(body, pattern, RegexOptions.None);
        Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Length);
        for (int i = 0; i < bodyParagraphsnew.Length; i++)
        {
            Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i]);
        }
    }
}

使用系统；
使用System.Text.RegularExpressions；
班级计划
{
静态void Main（字符串[]参数）
{
string body=“一些文本\r\n一些其他文本”；
字符串模式=@“]*？>（.*？”；
string[]bodyParagraphsnew=Regex.Split（body、pattern、RegexOptions.None）；
Console.WriteLine（“段落数=”+正文段落新长度）；
对于（int i=0；i

您将从中得到的是：
“”-从第一个字符串之前提取的空字符串
“一些文本”-第一个的内容，因为捕获组
“\r\n”-第一个结尾和最后一个开头之间的文本
“更多文本”-第二个div的内容，同样是因为捕获组
“”-从最后一个字符后面提取的空字符串
您可能想要的是div标记的内容。这可以通过使用以下代码来实现：
using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main(string[] args)
    {
        string body = "<div>some text</div>\r\n<div>some more text</div>";

        string pattern = @"<div[^>]*?>(.*?)</div>";
        MatchCollection bodyParagraphsnew = Regex.Matches(body, pattern, RegexOptions.None);
        Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Count);
        for (int i = 0; i < bodyParagraphsnew.Count; i++)
        {
            Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i].Groups[1].Value);
        }
    }
}

使用系统；
使用System.Text.RegularExpressions；
班级计划
{
静态void Main（字符串[]参数）
{
string body=“一些文本\r\n一些其他文本”；
字符串模式=@“]*？>（.*？”；
MatchCollection bodyParagraphsnew=Regex.Matches（body、pattern、RegexOptions.None）；
Console.WriteLine（“段落数=“+bodyParagraphsnew.Count”）；
对于（int i=0；i

但是请注意，在HTML中，div标记可以相互嵌套。例如，以下是有效的HTML字符串：
string test = "<div>Outer div<div>inner div</div>outer div again</div>";

string test=“外部diviner div再次外部div”；

在这种情况下，正则表达式将无法工作这主要是因为HTML不是一个好工具。要处理这种情况，您需要编写一个解析器（正则表达式只是其中的一小部分）。不过，就我个人而言，我不想麻烦，因为已经有很多开源HTML解析器可用。
有两种可能
使用llist而不是数组和list.remove
通过数组搜索\r\n并按索引将其删除
if(bodyParagraphsnew[i] == "\r\n")
{
bodyParagraphsnew = bodyParagraphsnew.Where(w => w != bodyParagraphsnew[i]).ToArray();
}


不是很好，但可能正是您想要的
您能给我们展示一个导致此问题的主体字符串示例吗？