Python 在spacy NER中区分国家和城市
我试图使用spacy NER从组织地址中提取国家,但是,它使用相同的标签标记国家和城市Python 在spacy NER中区分国家和城市,python,spacy,Python,Spacy,我试图使用spacy NER从组织地址中提取国家,但是,它使用相同的标签标记国家和城市GPE。有什么方法可以区分它们吗 例如: nlp = en_core_web_sm.load() doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United Stat
GPE
。有什么方法可以区分它们吗
例如:
nlp = en_core_web_sm.load()
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
for ent in doc.ents:
if ent.label_ == 'GPE':
print(ent.text)
回馈
Tempe
AZ
United States
United States
Tempe
AZ
United States
Tempe
AZ
United States
如前所述,
GPE
实体预测国家、城市和州
,因此您将无法仅检测具有给定模型的国家实体
我建议只创建一个国家列表,然后检查GPE
实体是否在此列表中
nlp = en_core_web_sm.load()
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
# create a list of country names that possibly appear in the text
countries = ['US', 'USA', 'United States']
for ent in doc.ents:
if ent.label_ == 'GPE':
# check if the value is in the list of countries
if ent.text in countries:
print(ent.text, '-- Country')
else:
print(ent.text, '-- City or State')
这将输出以下内容:
坦佩——城市还是州
美国--国家
蒙特利——城市还是州
美国--国家
坦佩——城市还是州
美国--国家
美国--国家
正如其他答案所提到的,预培训Spacy模型的GPE适用于国家、城市和州。但是,有一个解决办法,我相信可以使用几种方法 一种方法是:可以向模型添加自定义标记。有一篇很好的文章可以帮助你做到这一点。为此收集培训数据可能会很麻烦,因为您需要在句子中根据城市/国家各自的位置标记它们。我引述以下的答案: Spacy-NER模型训练包括提取其他“隐含”特征,如词性和周围词 当您尝试对单个单词进行训练时,无法获得足够的通用特征来检测这些实体 一个更简单的解决方法是: 安装 然后使用以下代码获取国家和城市的列表
import geonamescache
gc = geonamescache.GeonamesCache()
# gets nested dictionary for countries
countries = gc.get_countries()
# gets nested dictionary for cities
cities = gc.get_cities()
文档中指出,您还可以获得大量其他位置选项
使用以下函数从嵌套字典(从中获取)获取具有特定名称的键的所有值
分别加载城市和国家的两个列表
cities = [*gen_dict_extract(cities, 'name')]
countries = [*gen_dict_extract(countries, 'name')]
然后使用以下代码进行区分:
nlp = spacy.load("en_core_web_sm")
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
for ent in doc.ents:
if ent.label_ == 'GPE':
if ent.text in countries:
print(f"Country : {ent.text}")
elif ent.text in cities:
print(f"City : {ent.text}")
else:
print(f"Other GPE : {ent.text}")
输出:
City : Tempe
Other GPE : AZ
Country : United States
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
Spacy的文档说明GPE实体类型是针对国家、城市和州的。那么有什么解决方法吗?
nlp = spacy.load("en_core_web_sm")
doc= nlp('Resilience Engineering Institute, Tempe, AZ, United States; Naval Postgraduate School, Department of Operations Research, Monterey, CA, United States; Arizona State University, School of Sustainable Engineering and the Built Environment, Tempe, AZ, United States; Arizona State University, School for the Future of Innovation in Society, Tempe, AZ, United States')
for ent in doc.ents:
if ent.label_ == 'GPE':
if ent.text in countries:
print(f"Country : {ent.text}")
elif ent.text in cities:
print(f"City : {ent.text}")
else:
print(f"Other GPE : {ent.text}")
City : Tempe
Other GPE : AZ
Country : United States
Country : United States
City : Tempe
Other GPE : AZ
Country : United States
City : Tempe
Other GPE : AZ
Country : United States