Python 在文本中查找所有位置/城市/地点

Python 在文本中查找所有位置/城市/地点,python,nltk,corpus,text-analysis,tagged-corpus,Python,Nltk,Corpus,Text Analysis,Tagged Corpus,如果我有一篇包含加泰罗尼亚语报纸文章的文本,我如何从该文本中找到所有城市 我一直在看用于python的nltk包,并下载了用于加泰罗尼亚语的语料库(nltk.corpus.cess_cat) 我现在所拥有的: 我已经从nltk.download()安装了所有必要的软件。我现在拥有的一个例子: te = nltk.word_tokenize('Tots els gats son de Sant Cugat del Valles.') nltk.pos_tag(te) 这座城市是“圣库加特山谷”

如果我有一篇包含加泰罗尼亚语报纸文章的文本,我如何从该文本中找到所有城市

我一直在看用于python的nltk包,并下载了用于加泰罗尼亚语的语料库(nltk.corpus.cess_cat)

我现在所拥有的:
我已经从nltk.download()安装了所有必要的软件。我现在拥有的一个例子:

te = nltk.word_tokenize('Tots els gats son de Sant Cugat del Valles.')

nltk.pos_tag(te)
这座城市是“圣库加特山谷”。我从输出中得到的是:

[('Tots', 'NNS'),
 ('els', 'NNS'),
 ('gats', 'NNS'),
 ('son', 'VBP'),
 ('de', 'IN'),
 ('Sant', 'NNP'),
 ('Cugat', 'NNP'),
 ('del', 'NN'),
 ('Valles', 'NNP')]
NNP似乎表示第一个字母为大写的名词。有没有一种方法可以得到地方或城市,而不是所有的名字?
谢谢

您不需要为此使用NLTK。相反,请执行以下操作:

  • 把课文分成一个包含所有单词的列表
  • 将城市划分成一个字典,其中{“Sant Cugat del Valles”:[“Sant”、“Cugat”、“del”、“Valles”]}。在网上或从当地政府那里很容易找到该地区所有城市的名单
  • 以列表形式迭代文本中的元素

    3.1。如果元素第一个元素与文本中的元素相对应,则遍历城市,然后检查下一个元素

  • 下面是一个可运行的代码示例:

    text = 'Tots els gats son de Sant Cugat del Valles.'
    #Prepare your text. Remove "." (and other unnecessary marks).
    #Then split it into a list of words.
    text = text.replace('.','').split(' ')      
    
    #Insert the cities you want to search for.
    cities =  {"Sant Cugat del Valles":["Sant","Cugat","del","Valles"]} 
    
    found_match = False
    for word in text:
        if found_match:        
            cityTest = cityTest
        else:
            cityTest = ''
        found_match = False
        for city in cities.keys():            
            if word in cities[city]:
                cityTest += word + ' '
                found_match = True        
            if cityTest.split(' ')[0:-1] == city.split(' '):
                print city    #Print if it found a city.
    
    你要么做,要么做你自己的地名录

    我制作了一个简单的地名录,并用于像您这样的任务:

    # -*- coding: utf-8 -*-
    import codecs
    from lxml.html.builder import DT
    import os
    import re
    
    from nltk.chunk.util import conlltags2tree
    from nltk.chunk import ChunkParserI
    from nltk.tag import pos_tag
    from nltk.tokenize import wordpunct_tokenize
    
    
    def sub_leaves(tree, node):
        return [t.leaves() for t in tree.subtrees(lambda s: s.node == node)]
    
    
    class Gazetteer(ChunkParserI):
        """
        Find and annotate a list of words that matches patterns.
        Patterns may be regular expressions in the form list of tuples.
        Every tuple has the regular expression and the iob tag for this one.
        Before applying gazetteer words a part of speech tagging should
        be performed. So, you have to pass your tagger as a parameter.
        Example:
            >>> patterns = [(u"Αθήνα[ς]?", "LOC"), (u"Νομική[ς]? [Σσ]χολή[ς]?", "ORG")]
            >>> gazetteer = Gazetteer(patterns, nltk.pos_tag, nltk.wordpunct_tokenize)
            >>> text = u"Η Νομική σχολή της Αθήνας"
            >>> t = gazetteer.parse(text)
            >>> print(unicode(t))
            ... (S Η/DT (ORG Νομική/NN σχολή/NN) της/DT (LOC Αθήνας/NN))
        """
    
        def __init__(self, patterns, pos_tagger, tokenizer):
            """
            Initialize the class.
    
            :param patterns:
                The patterns to search in text is a list of tuples with regular
                expression and the tag to apply
            :param pos_tagger:
                The tagger to use for applying part of speech to the text
            :param tokenizer:
                The tokenizer to use for tokenizing the text
            """
            self.patterns = patterns
            self.pos_tag = pos_tagger
            self.tokenize = tokenizer
            self.lookahead = 0  # how many words it is possible to be a gazetteer word
            self.words = []  # Keep the words found by applying the regular expressions
            self.iobtags = []  # For each set of words keep the coresponding tag
    
        def iob_tags(self, tagged_sent):
            """
            Search the tagged sentences for gazetteer words and apply their iob tags.
    
            :param tagged_sent:
                A tokenized text with part of speech tags
            :type tagged_sent: list
            :return:
                yields the IOB tag of the word with it's character, eg. B-LOCATION
            :rtype:
            """
            i = 0
            l = len(tagged_sent)
            inside = False  # marks the I- tag
            iobs = []
    
            while i < l:
                word, pos_tag = tagged_sent[i]
                j = i + 1  # the next word
                k = j + self.lookahead  # how many words in a row we may search
                nextwords, nexttags = [], []  # for now, just the ith word
                add_tag = False  # no tag, this is O
    
                while j <= k:
                    words = ' '.join([word] + nextwords)  # expand our word list
                    if words in self.words:  # search for words
                        index = self.words.index(words)  # keep index to use for iob tags
                        if inside:
                            iobs.append((word, pos_tag, 'I-' + self.iobtags[index]))  # use the index tag
                        else:
                            iobs.append((word, pos_tag, 'B-' + self.iobtags[index]))
    
                        for nword, ntag in zip(nextwords, nexttags):  # there was more than one word
                            iobs.append((nword, ntag, 'I-' + self.iobtags[index]))  # apply I- tag to all of them
    
                        add_tag, inside = True, True
                        i = j  # skip tagged words
                        break
    
                    if j < l:  # we haven't reach the length of tagged sentences
                        nextword, nexttag = tagged_sent[j]  # get next word and it's tag
                        nextwords.append(nextword)
                        nexttags.append(nexttag)
                        j += 1
                    else:
                        break
    
                if not add_tag:  # unkown words
                    inside = False
                    i += 1
                    iobs.append((word, pos_tag, 'O'))  # it's an Outsider
    
            return iobs
    
        def parse(self, text, conlltags=True):
            """
            Given a text, applies tokenization, part of speech tagging and the
            gazetteer words with their tags. Returns an conll tree.
    
            :param text: The text to parse
            :type text: str
            :param conlltags:
            :type conlltags:
            :return: An conll tree
            :rtype:
            """
            # apply the regular expressions and find all the
            # gazetteer words in text
            for pattern, tag in self.patterns:
                words_found = set(re.findall(pattern, text))  # keep the unique words
                if len(words_found) > 0:
                    for word in words_found:  # words_found may be more than one
                        self.words.append(word)  # keep the words
                        self.iobtags.append(tag)  # and their tag
    
            # find the pattern with the maximum words.
            # this will be the look ahead variable
            for word in self.words:  # don't care about tags now
                nwords = word.count(' ')
                if nwords > self.lookahead:
                    self.lookahead = nwords
    
            # tokenize and apply part of speech tagging
            tagged_sent = self.pos_tag(self.tokenize(text))
            # find the iob tags
            iobs = self.iob_tags(tagged_sent)
    
            if conlltags:
                return conlltags2tree(iobs)
            else:
                return iobs
    
    
    if __name__ == "__main__":
        patterns = [(u"Αθήνα[ς]?", "LOC"), (u"Νομική[ς]? [Σσ]χολή[ς]?", "ORG")]
        g = Gazetteer(patterns, pos_tag, wordpunct_tokenize)
        text = u"Η Νομική σχολή της Αθήνας"
        t = g.parse(text)
        print(unicode(t))
    
    
        dir_with_lists = "Lists"
        patterns = []
        tags = []
        for root, dirs, files in os.walk(dir_with_lists):
            for f in files:
                lines = codecs.open(os.path.join(root, f), 'r', 'utf-8').readlines()
                tag = os.path.splitext(f)[0]
                for l in lines[1:]:
                    patterns.append((l.rstrip(), tag))
                    tags.append(tag)
    
        text = codecs.open("sample.txt", 'r', "utf-8").read()
        #g = Gazetteer(patterns)
        t = g.parse(text.lower())
        print unicode(t)
    
        for tag in set(tags):
            for gaz_word in sub_leaves(t, tag):
                print gaz_word[0][0], tag
    
    #-*-编码:utf-8-*-
    导入编解码器
    从lxml.html.builder导入DT
    导入操作系统
    进口稀土
    从nltk.chunk.util导入conlltags2tree
    从nltk.chunk导入ChunkParserI
    从nltk.tag导入pos_标记
    从nltk.tokenize导入wordpunct\u tokenize
    def sub_叶(树、节点):
    在tree.subtrees中为t返回[t.leaves()(lambda s:s.node==node)]
    类别地名录(CHUNKPASSERI):
    """
    查找并注释与模式匹配的单词列表。
    模式可以是元组列表形式的正则表达式。
    每个元组都有正则表达式和这个元组的iob标记。
    在使用地名索引词之前,应先进行词性标注
    因此,必须将标记器作为参数传递。
    例子:
    >>>模式=[(u“θθνα[ς]?”,“LOC”),(u“μοκ[ς]?[σ]χολ[ς]?,“ORG”)]
    >>>地名索引=地名索引(模式、nltk.pos_标记、nltk.wordputt_标记化)
    >>>text=u“ΗΝομικήσχολήτηςΑθήνας”
    >>>t=地名索引。解析(文本)
    >>>打印(unicode(t))
    …(SΗ/DT(ORGΝμοκή/NNσχολή/NN)της/DT(LOCΑθήνας/NN))
    """
    定义初始化(自我、模式、位置标记器、标记器):
    """
    初始化该类。
    :参数模式:
    要在文本中搜索的模式是一个具有正则表达式的元组列表
    表达式和要应用的标记
    :参数位置标记器:
    用于将词性应用于文本的标记器
    :参数标记器:
    用于标记文本的标记器
    """
    self.patterns=模式
    self.pos\u tag=pos\u tagger
    self.tokenize=tokenizer
    self.lookahead=0#有多少个单词可能是地名录中的单词
    self.words=[]#保留通过应用正则表达式找到的单词
    self.iobtags=[]#对于每组单词,保留相应的标记
    def iob_标记(自标记、已发送标记):
    """
    在标记的句子中搜索地名索引词并应用它们的iob标记。
    :参数已发送:
    带有词性标记的标记化文本
    :类型已发送:列表
    :返回:
    生成单词的IOB标签及其字符,例如B-LOCATION
    :rtype:
    """
    i=0
    l=len(已标记且已发送)
    inside=False#标记I标记
    iobs=[]
    而我
    中,如果uuuu name_uuuuuuu==“uuuuu main_uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

    在后面的代码中,我从名为
    列表的目录中读取文件(将其放在上面代码所在的文件夹中)。每个文件的名称都成为地名索引的标签。因此,使用位置模式(
    LOC
    tag)创建像
    LOC.txt
    这样的文件,
    PERSON.txt
    用于PERSON等。

    您可以使用python库进行同样的操作

    pip install geotext
    
    这就是我的全部
    from geotext import GeoText
    places = GeoText("London is a great city")
    places.cities