C# 如何使用正则表达式按单词分隔文本？_C#

C# 如何使用正则表达式按单词分隔文本？

C# 如何使用正则表达式按单词分隔文本？,c#,C#,我有一个.srt文件，它有一些文本结构。例如： 1 00:00:01,514 --> 00:00:04,185 I'm investigating Saturday night's shootings. 2 00:00:04,219 --> 00:00:05,754 What's to investigate? Innocent people 我希望得到像“我在”、“调查”、“星期六”等分裂的词我创造了一个新的模式 @"[a-zA-Z']" 这篇文章几乎是正确的。但是.s

我有一个.srt文件，它有一些文本结构。例如：

1
00:00:01,514 --> 00:00:04,185
I'm investigating
Saturday night's shootings.

2
00:00:04,219 --> 00:00:05,754
What's to investigate?
Innocent people

我希望得到像“我在”、“调查”、“星期六”等分裂的词

我创造了一个新的模式

@"[a-zA-Z']"

这篇文章几乎是正确的。但是.srt文件也包含一些像这样无用的标记结构

<i>

我想把它去掉

我如何构建模式，将文本逐字分隔并删除“”之间的所有文本（包括大括号）？

在regexp中很难用一种方法实现这一点（至少对我来说是这样），但您可以通过两个步骤来实现

首先从字符串中删除html字符，然后提取后面的单词

看看下面

var text = "00:00:01,514 --> 00:00:04,185 I'm investigating Saturday night's shootings.<i>"

// remove all html char
var noHtml = Regex.Replace(text, @"(<[^>]*>).*", "");

// and now you could get only the words by using @"[a-zA-Z']" on noHtml. You should get "I'm investigating Saturday night's shootings."

var text=“00:00:01514-->00:00:04185我正在调查周六晚上的枪击案。”
//删除所有html字符
var noHtml=Regex.Replace（文本，@“（]*>）.*”，“”）；
//现在你只能通过在noHtml上使用@“[a-zA-Z']”来获取单词。你应该得到“我正在调查周六晚上的枪击案”

你可以消极地环顾四周，断言不存在非

序列，也不存在一个

序列后面紧跟着一个非

序列前面的序列
using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        string input = @"
<garbage>
Hello world, <rubbish>it's a wonderful day.



<trash>
";
        foreach (Match match in Regex.Matches(input, @"(?<!<[^>]*)[a-zA-Z']+(?![^<]*>)"))
        {
            Console.WriteLine(match.Value);
        }
    }
}


Hello
world
it's
a
wonderful
day