Python 正则表达式-在文本中搜索相似的国家名称_Python_Regex

Python 正则表达式-在文本中搜索相似的国家名称

python regex

Python 正则表达式-在文本中搜索相似的国家名称,python,regex,Python,Regex,我想从预定义国家列表中确定文本中出现的国家。问题是，有些名字非常相似，所以如果文本中有一个国家，它也会识别另一个国家。例如： text1 = "The disease has spread to three countries: Guinea, Guinea-Bassau and Equatorial Guinea." text2 = "Only Guinea-Bassau and Equatorial Guinea contained strains of the virus." li

我想从预定义国家列表中确定文本中出现的国家。问题是，有些名字非常相似，所以如果文本中有一个国家，它也会识别另一个国家。例如：

text1 = "The disease has spread to three countries: Guinea, Guinea-Bassau and Equatorial Guinea."

text2 = "Only Guinea-Bassau and Equatorial Guinea contained strains of the virus."

list_of_countries = ['Guinea', 'Guinea-Bassau', 'Equatorial Guinea']

我还没有找到一个代码可以返回text1的所有三个列表项，而text2只返回“几内亚-巴索”和“赤道几内亚”

这只是一个具体的例子。当然，我可以为非洲3个几内亚国家的具体问题制定一个临时解决方案，但问题将回到“刚果共和国”和“刚果民主共和国”等

编辑：我觉得解决这个问题的一种方法是删除/放弃文本中的任何实例，只要它与可能的最长命名国家匹配。

您可以使用

import re

text1 = "The disease has spread to three countries: Guinea, Guinea-Bassau and Equatorial Guinea."
text2 = "Only Guinea-Bassau and Equatorial Guinea contained strains of the virus."
list_of_countries = ['Guinea', 'Guinea-Bassau', 'Equatorial Guinea']

# Sort the list by length in descending order
list_of_countries=sorted(list_of_countries,key=len,reverse=True)
# Build the alternation based regex with \b to match each item as a whole word 
rx=r'\b(?:{})\b'.format("|".join(list_of_countries))
print(re.findall(rx, text1))
# => ['Guinea', 'Guinea-Bassau', 'Equatorial Guinea']
print(re.findall(rx, text2))
# => ['Guinea-Bassau', 'Equatorial Guinea']

见

请注意，按长度按降序对国家列表进行排序非常重要，因为列表中的项目可能有空格，并且可能从字符串中的相同位置开始

正则表达式是

\b(?:Equatorial Guinea|Guinea-Bassau|Guinea)\b

见

详细信息

```
\b
```
-单词边界
```
（？：
```
-启动非捕获组，以便将单词边界应用于每个替换项
- ```
赤道几内亚
```
- ```
|
```
  -或
- ```
几内亚巴索
```
- ```
|
```
  -或
- ```
几内亚
```
```
）
```
-组结束
```
\b
```
-单词边界

@jdehesa好的，空白边界在这里不起作用@不顾一切，试试：

list\u of_countries=sorted（list\u of_countries，key=len，reverse=True）

rx=r'\b（？：{}）\b.format（“|”）.join（list\u of_countries））

打印（re.findall（rx，text1））
@WiktorStribiż我明白了，太棒了。它似乎在不整理列表的情况下工作？它甚至可以在更复杂的情况下工作，例如，构造良好的正则表达式解决方案将返回最长的匹配，并且无论如何都不会返回重叠的匹配。如果您试图用正则表达式解决这个问题，请向我们展示您的regexp。将其编辑到问题中，而不是评论中。@WiktorStribiżew您的解决方案非常有效，谢谢。我不知道正则表达式中的（？：..）选项。@Knowname我在下面发布了一个。仅供参考：是国家/地区列表和文本中的几内亚
和几内亚巴索
（我删除了连字符以显示问题）的示例。注意结果列表中如何缺少Bassau
。