Python 替换忽略特定单词的所有连续重复字母_Python_Regex_Text_Preprocessor

Python 替换忽略特定单词的所有连续重复字母

python regex text

Python 替换忽略特定单词的所有连续重复字母,python,regex,text,preprocessor,Python,Regex,Text,Preprocessor,我看到了很多建议，可以使用re（regex）或.join来删除一个句子中连续重复的字母，但我想对特殊单词有一个例外例如：我要这句话>这句话='你好，请使用this lllink加入这个会议' 要像这样>“您好，请使用此链接加入此会议” 知道我要保留并忽略重复字母的单词列表，请检查：keepWord=['Hello'，'meeting'] 我发现有用的两个脚本是：使用.join： import itertools sentence = ''.join(c[0] for c in iter

我看到了很多建议，可以使用re（regex）或.join来删除一个句子中连续重复的字母，但我想对特殊单词有一个例外

例如：

我要这句话>

这句话='你好，请使用this lllink加入这个会议'

要像这样>

“您好，请使用此链接加入此会议”

知道我要保留并忽略重复字母的单词列表，请检查：

keepWord=['Hello'，'meeting']

我发现有用的两个脚本是：

使用.join：

import itertools

sentence = ''.join(c[0] for c in itertools.groupby(sentence))

使用正则表达式：

import re

sentence = re.compile(r'(.)\1{1,}').sub(r'\1', sentence)

我有一个解决办法，但我认为还有一个更紧凑、更有效的办法。我目前的解决方案是：

import itertools

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']

new_sentence = ''

for word in sentence.split():
    if word not in keepWord:
        new_word = ''.join(c[0] for c in itertools.groupby(word))
        new_sentence = sentence +" " + new_word
    else:
        new_sentence = sentence +" " + word

有什么建议吗？

您可以匹配

keepWord

列表中的整个单词，在其他上下文中只替换两个或更多相同字母的序列：

import re
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = re.sub(fr"\b(?:{'|'.join(keepWord)})\b|([^\W\d_])\1+", lambda x: x.group(1) or x.group(), sentence)
print(new_sentence)
# => hello, join this meeting here using this link

见

正则表达式看起来像

\b(?:hello|meeting)\b|([^\W\d_])\1+

看。如果组1匹配，则返回其值，否则，将返回完全匹配（要保留的单词）

图案细节

\b（？：hello | meeting）\b

hello

或

meeting

包含单词边界

```
|
```
-或
```
（[^\W\d]）
```
-第1组：任何Unicode字母
```
\1+
```
-对组1值的一个或多个反向引用

虽然不是特别紧凑，但这里有一个使用regexp的相当简单的示例：函数

subst

将用单个字符替换重复的字符，然后使用

re.sub

为找到的每个单词调用它

这里假设，因为您的示例

keepWord

列表（第一次提到的位置）的标题大小写为

Hello

，但文本的大小写为

Hello

，所以您希望对列表执行不区分大小写的比较。因此，无论你的句子包含

Hello

还是

Hello

，它都同样有效

import re

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['Hello','meeting']

keepWord_s = set(word.lower() for word in keepWord)

def subst(match):
    word = match.group(0)
    return word if word.lower() in keepWord_s else re.sub(r'(.)\1+', r'\1', word)

print(re.sub(r'\b.+?\b', subst, sentence))

给出：

hello, join this meeting here using this link

如果出现

Helllo

，您会有什么期望？好吧，我的建议中没有处理这个问题，可以通过忽略

else

下第一次出现的字母来解决。太好了，这正好达到了预期的输出。感谢you@Aisha如果需要不区分大小写的搜索，请在regex模式的开头添加

（？i）

。或者将第四个参数添加到

re.sub

：

re.sub（…，…，句子，flags=re.I）