C# 需要正则表达式返回第一段或前n个单词_C#_Regex

C# 需要正则表达式返回第一段或前n个单词

c# regex

C# 需要正则表达式返回第一段或前n个单词,c#,regex,C#,Regex,我正在寻找一个正则表达式来返回段落中的前[n]个单词，或者，如果段落包含少于[n]个单词，则返回完整的段落例如，假设我最多需要前7个单词： one two <tag>three</tag> four five, six seven eight nine ten.ignore P＞12345，678910。 忽略我会得到： one two <tag>three</

我正在寻找一个正则表达式来返回段落中的前[n]个单词，或者，如果段落包含少于[n]个单词，则返回完整的段落

例如，假设我最多需要前7个单词：

<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p>

<代码> P＞12345，678910。

忽略

我会得到：

one two <tag>three</tag> four five, six seven

一二三四五六七

在包含少于要求字数的段落上使用相同的正则表达式：

<p>one two <tag>three</tag> four five.</p><p>ignore</p>

一二三四五。
忽略

只需返回：

one two <tag>three</tag> four five.

123445。

我尝试解决此问题时产生了以下正则表达式：

^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>)

^（？：\）（（？：\w+\b.*）{1,7}）。*（？：\）

然而，这只返回第一个单词——“一”。它不起作用。我想那是什么？（在\w+\b之后）导致问题

我哪里做错了？有人能提供一个有效的正则表达式吗

仅供参考，我正在使用.NET3.5的正则表达式引擎（通过C#）

非常感谢

使用HTML解析器获取第一个段落，将其结构展平（即删除段落内的装饰性HTML标记）

搜索第n个空白字符的位置

将子字符串从0移到该位置

编辑：我删除了第2步和第3步的regex建议，因为它是错误的（感谢评论）。此外，HTML结构需要展平。

好，完成重新编辑以确认新的“规范”：

我很确定你不能用一个正则表达式。最好的工具无疑是HTML解析器。我能用正则表达式得到的最接近的方法是两步方法

首先，将每个段落的内容与以下内容分开：

<p>(.*?)</p>

这将匹配由空格/制表符/换行符分隔的前七项，忽略任何尾随标点符号或非单词字符

但它会将由空格分隔的标记视为这些项之一，即。E在

One, two three <br\> four five six seven

一二三四五六七

它将只匹配到

六个

。我想从正则表达式的角度来看，没有办法解决这个问题。

我遇到了同样的问题，并将一些堆栈溢出的答案组合到这个类中。它使用HtmlAgilityPack，这是一个更好的工作工具。电话：

 Words(string html, int n)

得到n个单词

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;


namespace UmbracoUtilities
{
    public class Text
    {
      /// <summary>
      /// Return the first n words in the html
      /// </summary>
      /// <param name="html"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string Words(string html, int n)
      {
        string words = html, n_words;

        words = StripHtml(html);
        n_words = GetNWords(words, n);

        return n_words;
      }


      /// <summary>
      /// Returns the first n words in text
      /// Assumes text is not a html string
      /// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
      /// </summary>
      /// <param name="text"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string GetNWords(string text, int n)
      {
        StringBuilder builder = new StringBuilder();

        //remove multiple spaces
        //http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
        string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
        IEnumerable<string> words = cleanedString.Split().Take(n + 1);

        foreach (string word in words)
          builder.Append(" " + word);

        return builder.ToString();
      }


      /// <summary>
      /// Returns a string of html with tags removed
      /// </summary>
      /// <param name="html"></param>
      /// <returns></returns>
      public static string StripHtml(string html)
      {
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);

        var root = document.DocumentNode;
        var stringBuilder = new StringBuilder();

        foreach (var node in root.DescendantsAndSelf())
        {
          if (!node.HasChildNodes)
          {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
              stringBuilder.Append(" " + text.Trim());
          }
        }

        return stringBuilder.ToString();
      }



    }
}

使用HtmlAgilityPack；
使用制度；
使用System.Collections.Generic；
使用System.Linq；
使用系统文本；
使用System.Threading.Tasks；
命名空间实用程序
{
公共类文本
{
/// 
///返回html中的前n个单词
/// 
/// 
/// 
/// 
公共静态字符串字（字符串html，int n）
{
字符串字=html，n_字；
words=StripHtml（html）；
n_words=GetNWords（words，n）；
返回n_单词；
}
/// 
///返回文本中的前n个单词
///假定文本不是html字符串
/// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
/// 
/// 
/// 
/// 
公共静态字符串GetNWords（字符串文本，int-n）
{
StringBuilder=新的StringBuilder（）；
//删除多个空格
//http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
string cleanedString=System.Text.RegularExpressions.Regex.Replace（Text，@“\s+”，“”）；
IEnumerable words=cleanedString.Split（）.Take（n+1）；
foreach（单词中的字符串）
builder.Append（““+word”）；
返回builder.ToString（）；
}
/// 
///返回已删除标记的html字符串
/// 
/// 
/// 
公共静态字符串StripHtml（字符串html）
{
HtmlDocument document=新的HtmlDocument（）；
document.LoadHtml（html）；
var root=document.DocumentNode；
var stringBuilder=新的stringBuilder（）；
foreach（root.genderantsandself（）中的var节点）
{
如果（！node.HasChildNodes）
{
字符串文本=node.InnerText；
如果（！string.IsNullOrEmpty（text））
stringBuilder.Append（“+text.Trim（））；
}
}
返回stringBuilder.ToString（）；
}
}
}

圣诞快乐

这太完美了-干杯！我知道永远不会有嵌套的p标记，所以正则表达式是一个很好的选择。感谢您的努力-我真的很感激它（感谢您指出我原来的“规范”中的疏忽），在字符类中，\b匹配退格字符。此外，问题的定义似乎已经改变，因为你张贴此\w和\w不会将其删除。

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;


namespace UmbracoUtilities
{
    public class Text
    {
      /// <summary>
      /// Return the first n words in the html
      /// </summary>
      /// <param name="html"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string Words(string html, int n)
      {
        string words = html, n_words;

        words = StripHtml(html);
        n_words = GetNWords(words, n);

        return n_words;
      }


      /// <summary>
      /// Returns the first n words in text
      /// Assumes text is not a html string
      /// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
      /// </summary>
      /// <param name="text"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string GetNWords(string text, int n)
      {
        StringBuilder builder = new StringBuilder();

        //remove multiple spaces
        //http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
        string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
        IEnumerable<string> words = cleanedString.Split().Take(n + 1);

        foreach (string word in words)
          builder.Append(" " + word);

        return builder.ToString();
      }


      /// <summary>
      /// Returns a string of html with tags removed
      /// </summary>
      /// <param name="html"></param>
      /// <returns></returns>
      public static string StripHtml(string html)
      {
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);

        var root = document.DocumentNode;
        var stringBuilder = new StringBuilder();

        foreach (var node in root.DescendantsAndSelf())
        {
          if (!node.HasChildNodes)
          {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
              stringBuilder.Append(" " + text.Trim());
          }
        }

        return stringBuilder.ToString();
      }



    }
}