Python 从文本中提取国籍和国家

Python 从文本中提取国籍和国家,python,nlp,nltk,pos-tagger,Python,Nlp,Nltk,Pos Tagger,我想使用nltk从文本中提取所有国家和国籍,我使用词性标记提取所有GPE标记的标记,但结果并不令人满意 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' di

我想使用nltk从文本中提取所有国家和国籍,我使用词性标记提取所有GPE标记的标记,但结果并不令人满意

 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "

  sent = nltk.tokenize.wordpunct_tokenize(abstract)
  pos_tag = nltk.pos_tag(sent)
  nes = nltk.ne_chunk(pos_tag)
  places = []
  for ne in nes:
      if type(ne) is nltk.tree.Tree:
         if (ne.label() == 'GPE'):
            places.append(u' '.join([i[0] for i in ne.leaves()]))
      if len(places) == 0:
          places.append("N/A")
所得结果如下:

['Thyroid', 'Australian', 'Caucasian', 'Graves']
有些是国籍,有些只是名词


那么我做错了什么,或者有没有其他方法来提取这些信息?

如果你想提取国家名称,你需要的是NER标记,而不是POS标记

命名实体识别(NER)是信息提取的一个子任务,旨在将文本中的元素定位并分类为预定义的类别,如人名、组织、位置、时间表达式、数量、货币值、百分比等

看看Stanford NER tagger

from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split()) 
下面是使用NLTK执行实体提取的示例。它以地名录的形式存储所有地方和地点。然后,它在地名录上进行查找,以获取相关的地点和位置。查看文档以了解更多使用详细信息-

from geograpy import extraction

e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune-
mediated orbital inflammation that can lead to disfigurement and blindness. 
Multiple genetic loci have been associated with Graves' disease, but the genetic 
basis for TO is largely unknown. This study aimed to identify loci associated with 
TO in individuals with Graves' disease, using a genome-wide association scan 
(GWAS) for the first time to our knowledge in TO.Genome-wide association scan was 
performed on pooled DNA from an Australian Caucasian discovery cohort of 265 
participants with Graves' disease and TO (cases) and 147 patients with Graves' 
disease without TO (controls).")

e.find_entities()
print e.places()

因此,在这些富有成效的评论之后,我深入挖掘了不同的NER工具,以找到识别国籍和国家提及的最佳方法,并发现SPACY有一个NORP实体,可以有效地提取国籍。

您可以将Spacy用于NER。与NLTK相比,它给出了更好的结果

import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(u"Apple is opening its first big office in San Francisco and California.")
print([(ent.text, ent.label_) for ent in doc.ents])

他已经完成了实体提取!!也许是在不知情的情况下。你的回答只是给了他一个分类词列表。你甚至没有给他一个GPE列表。请编辑你的回答。你没有做错什么。您执行了实体提取,然后获取实体块并在其中搜索GPE标签。您对NLTK结果不满意的原因是NLTK在分类实体方面的性能通常很差。有一些查找表可供GPE使用。它们非常全面和高效。使用它们而不是依赖NLTK。谢谢,你能给我一个这些查找表的例子吗…我实际上试图安装geograpy,但失败了。。这就是我依赖nltk的原因。同样的问题是我无法安装geograpy:(请在安装geography之前安装NLTK,或者您可以对geography执行
pip install geography NLTK
,这对我来说很有用:虽然对于python3使用来说很旧,但是对于python3使用-pip3 install geography 3spacy非常棒,而且功能非常强大。我还建议您也可以使用Alchemy API。尽管对于大数据,最好使用sPacy,因为它没有对每个查询和结果施加交易成本。正如我们所知,spacy会将位置标记为{GPE}。在我的例子中,我有两个位置标记为GPE(例如印度、德里)。现在我的目标是确定哪个是城市和国家。请评论@Renaud