Regex 从输入字符串中删除重复的单词。使用VBA正则表达式_Regex_Vba_Excel

Regex 从输入字符串中删除重复的单词。使用VBA正则表达式

regex vba excel

Regex 从输入字符串中删除重复的单词。使用VBA正则表达式,regex,vba,excel,Regex,Vba,Excel,语言：VBA 环境：Excel2007 工具：RegEx对象嗨！我试图从表示地址的输入字符串中删除重复的单词。我收到一个excel工作表列，其中包含以某种方式组合的地址段。它不是西里尔文的，但如果用英语表示，它会像这样： 125424, RepeatedName, RepeatedName, and some words, 75 194044, Repeated-dashedName, Repeated-dashedName, other Uniques, 3 300911, Normal

语言：VBA 环境：Excel2007 工具：RegEx对象

嗨！我试图从表示地址的输入字符串中删除重复的单词。我收到一个excel工作表列，其中包含以某种方式组合的地址段。它不是西里尔文的，但如果用英语表示，它会像这样：

125424, RepeatedName, RepeatedName, and some words, 75
194044, Repeated-dashedName, Repeated-dashedName, other Uniques, 3
300911, Normal non-repeated, names, dashed and non-Dashed, 123

125424, RepeatedName, and some words, 75
194044, Repeated-dashedName, other Uniques, 3
300911, Normal non-repeated, names, dashed and non-Dashed, 123

文本不区分大小写，可以包含数字、标点符号和空格。众所周知，重复的单词只能一个接一个地出现，在重复的实例中，除了comas和spaces之外，不会有其他“排他的”单词。我需要删除重复的虚线和非虚线单词，如果每个单词重复，则只保留一个实例。我需要将“RepeatedName”的唯一实例与“RepeatedDashedName”保持相同。因此理想的结果如下所示：

125424, RepeatedName, RepeatedName, and some words, 75
194044, Repeated-dashedName, Repeated-dashedName, other Uniques, 3
300911, Normal non-repeated, names, dashed and non-Dashed, 123

125424, RepeatedName, and some words, 75
194044, Repeated-dashedName, other Uniques, 3
300911, Normal non-repeated, names, dashed and non-Dashed, 123

为了解决这个问题，我尝试了不同的代码变体，但其中一个正在从我身边消失。我最好的猜测是：

Option Explicit
Dim strIn As String, strPattern As String, strReplace As String, strResult as String
dim regex As Object

strIn = fnGetInputString()
strPattern = ".*\b((\w+)\b.*\1).*"
strReplace = "$1"

If regex Is Nothing Then Set regex = New RegExp

With regex
  .MultiLine = False
  .Global = True
  .IgnoreCase = True
  .Pattern = strPattern
End With

strResult = regex.Replace(strIn, strReplace)

但结果，我的研究结果只给了我以下信息：

75
3
123

因此，我没有在正则表达式中正确地捕获和重用重复组。任何帮助都将不胜感激

我不熟悉regex，但阅读了一些文档、文章、讨论和StackOverflow问题，但没有找到有效的答案。

此regex适用于以下示例：

\b([a-zA-Z-]+)[^a-zA-Z-]+\1\b

基本上，它是这样工作的：

\b([a-zA-Z-]+)[^a-zA-Z-]+\1\b
 ^                          ^      assert a word boundary
   ^   ^   ^ ^                     capture a 'word' series of characters
                ^                  separated by non 'word' characters
                         ^         where the captured word is then repeated

字符集

[a-zA-Z-]

仅限于ASCII意义上的“字母”加破折号。（确保

位于字符类的末尾，否则您将定义一个范围。）

对于非拉丁或非ASCII字符集，您可以在更现代的正则表达式引擎中使用

\p{L}

，也可以通过排除单词以外的内容来反转“单词”的含义：

\b([^ ,]+)[ ,]+\1\b

     ^                    a word is not a space or a comma...
            ^             a word delimiter is a space or comma...

即使在基本的正则表达式引擎中，如

sed

我们可以使用类似于

125424，RepeatedName，someOtherWord，RepeatedName和somewords，75的情况下，RepeatedName的第二个实例可以位于字符串中的任何位置（即，不只是在第一个实例之后）？这种可能性不是零，而是接近零。我们可以忽略这些情况。老兄，谢谢你的回答。我理解逻辑和机制。我已经在输入字符串中用英语和西里尔字母组合进行了测试。它只适用于英文副本的搜索和替换。尽管我已经在整个字母表中输入了大写字母和小写字母，加上您给定的英文字母范围。因此我怀疑问题可能与编码或其他方面有关。如果excel支持，请使用\p{L}
的“字母”的元字符，我已尝试将strPattern=“\b（[\p{L}-]+)[^\p{L}-]+\1\b”但是regex.Test（strIn）返回给我“False”。因此，我认为excel VBA正则表达式不支持元字符\p{L}。您还可以执行\b（[^，]+）[，]+\1\b
来捕获非空格、非逗号后跟空格或逗号。这将拾取非ascii字符。一般的方案是相同的：X+
是一个单词，[^X]+
不是，然后重复X+
只需使用要捕获的字符集，并对要定界的字符集求反，然后返回到捕获的字符集。