Python 如何在句子中定位字符串和子字符串_Python_Regex

Python 如何在句子中定位字符串和子字符串

python regex

Python 如何在句子中定位字符串和子字符串,python,regex,Python,Regex,我试图用正则表达式在句子中定位项目（其中一个是另一个的子字符串），但它总是定位子字符串。例如，有两个项目[“公爵”、“A公爵]和一些句子：公爵《公爵》是一部电影电影《公爵》怎么样公爵《A公爵》是一部电影电影《A公爵》怎么样找到位置后，我想要的是：公爵《公爵》是一部电影电影《公爵》怎么样阿的公爵阿的公爵是一部电影电影《阿的阿公爵》怎么样我尝试过的代码是：对于以句子形式发送：对于[“公爵”、“A公爵”]中的项目： find=re.search（r'{0}'。格式（项）

我试图用正则表达式在句子中定位项目（其中一个是另一个的子字符串），但它总是定位子字符串。例如，有两个项目[“公爵”、“A公爵]和一些句子：

公爵

《公爵》是一部电影

电影《公爵》怎么样

公爵

《A公爵》是一部电影

电影《A公爵》怎么样

找到位置后，我想要的是：

公爵

《公爵》是一部电影

电影《公爵》怎么样

阿的公爵

阿的公爵是一部电影

电影《阿的阿公爵》怎么样

我尝试过的代码是：

对于以句子形式发送：
对于[“公爵”、“A公爵”]中的项目：
find=re.search（r'{0}'。格式（项），已发送）
如果找到：
sent=sent.replace（sent[find.start（）：find.end（）]，item.replace（“，”））

但我得到了：

公爵

《公爵》是一部电影

电影《公爵》怎么样

王公

《A公爵》是一部电影

电影《A的公爵》怎么样

更改列表中项目的位置不适合我的情况，因为我的列表很大（超过10000个项目）。

您要做的是首先查找“公爵”。如果重新找到匹配项，则将其替换为“公爵”。现在循环的第二个过程是寻找“A公爵”，但是你找不到任何匹配项，因为你之前已经更改了它

这应该行得通

for sent in sentences:
for item in ["The Duke of A", "The Duke"]:
    find = re.search(r'{0}'.format(item), sent)
    if find:
       sent = sent.replace(sent[find.start():find.end()], item.replace(" ", "_"))

如果无法更改列表中项目的位置，可以尝试此版本。在第一次过程中，我们收集所有匹配项，在第二次过程中，我们进行替换：

data = '''The Duke
The Duke is a movie.
How is the movie The Duke?
The Duke of A
The Duke of A is a movie.
How is the movie The Duke of A?'''

terms = ["The Duke", "The Duke of A"]

import re

to_change = []
for t in terms:
    for g in re.finditer(t, data):
        to_change.append((g.start(), g.end()))

for (start, end) in to_change:
    data = data[:start] + re.sub(r'\s', r'_', data[start:end]) + data[end:]

print(data)

印刷品：

The_Duke
The_Duke is a movie.
How is the movie The_Duke?
The_Duke_of_A
The_Duke_of_A is a movie.
How is the movie The_Duke_of_A?

“A公爵”和“公爵”的互换位置：

for item in ["The Duke", "The Duke of A"]:

变成

for item in ["The Duke of A", "The Duke"]:

您可以使用

re.sub

，并且

repl

可以是一个函数，因此只需替换结果中的空格即可

import re

with open("filename.txt") as sentences:
    for line in sentences:
        print(re.sub(r"The Duke of A|The Duke",
                     lambda s: s[0].replace(' ', '_'),
                     line))

这导致：

The_Duke

The_Duke is a movie.

How is the movie The_Duke?

The_Duke_of_A

The_Duke_of_A is a movie.

How is the movie The_Duke_of_A?

如果一个句子同时有两个子串“公爵”、“a公爵”@bharatk我的数据中没有这样一个句子。看看你的示例数据句子

a公爵是一部电影。

，这个句子有

“公爵”、“a公爵”

两个子串，它总是首先匹配

“公爵”

substring。谢谢@Jab，但我还有一个很大的列表。不可能在正则表达式中列出它们。如果“大列表”指的是一个大的替换列表，那么只需使用

re.sub（“（“+”|“.join（replacements）+”），…）