Python nltk:如何防止专有名词词干的产生

Python nltk:如何防止专有名词词干的产生,python,nlp,nltk,stanford-nlp,stemming,Python,Nlp,Nltk,Stanford Nlp,Stemming,我正试图用Stanford POS taggers和NER编写一个关键词提取程序。对于关键词提取,我只对专有名词感兴趣。以下是基本方法 通过删除除字母以外的任何内容来清理数据 删除停止字 词干 确定每个单词的词性标记 如果POS标签是一个名词,则将其输入NER 然后,NER将确定该词是个人、组织还是地点 示例代码 docText="'Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in Londo

我正试图用Stanford POS taggers和NER编写一个关键词提取程序。对于关键词提取,我只对专有名词感兴趣。以下是基本方法

  • 通过删除除字母以外的任何内容来清理数据
  • 删除停止字
  • 词干
  • 确定每个单词的词性标记
  • 如果POS标签是一个名词,则将其输入NER
  • 然后,NER将确定该词是个人、组织还是地点
  • 示例代码

    docText="'Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London"
    
    words = re.split("\W+",docText) 
    
    stops = set(stopwords.words("english"))
    
    #remove stop words from the list
    words = [w for w in words if w not in stops and len(w) > 2]
    
    # Stemming
    pstem = PorterStemmer()
    
    words = [pstem.stem(w) for w in words]    
    
    nounsWeWant = set(['NN' ,'NNS', 'NNP', 'NNPS'])
    
    finalWords = []
    
    stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
    stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
    
    for w in words:
        if stp.tag([w.lower()])[0][1] not in nounsWeWant:
            finalWords.append(w.lower())
        else:
            finalWords.append(w)
    
    finalString = " ".join(finalWords)
    print finalString
    
    tagged = stn.tag(finalWords)
    print tagged
    
    这让我

    Jack Frost work Boe Compani manag aircraft crew London
    [(u'Jack', u'PERSON'), (u'Frost', u'PERSON'), (u'work', u'O'), (u'Boe', u'O'), (u'Compani', u'O'), (u'manag', u'O'), (u'aircraft', u'O'), (u'crew', u'O'), (u'London', u'LOCATION')]
    
    所以很明显,我不希望波音公司被阻止。也不是公司。我需要对单词进行词干处理,因为我的输入可能包含
    Performing
    之类的术语。我已经看到,像
    Performing
    这样的词将被NER作为专有名词使用,因此可以归类为
    组织
    。因此,首先,我将所有单词词干转换成小写。然后我检查这个词的词性标签是否是名词。如果是这样,我会保持原样。如果没有,我将该单词转换为小写,并将其添加到最终的单词列表中,该列表将传递给NER


    有没有关于如何避免词干专有名词的想法?

    使用完整的斯坦福CoreNLP管道来处理NLP工具链。避免使用您自己的标记器、清洁器、POS标记器等。它不会很好地使用NER工具

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
    unzip http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
    cd stanford-corenlp-full-2015-12-09
    echo "Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London" > test.txt
    java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file test.txt
    cat test.txt.out 
    
    [out]:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
    <root>
      <document>
        <sentences>
          <sentence id="1">
            <tokens>
              <token id="1">
                <word>Jack</word>
                <lemma>Jack</lemma>
                <CharacterOffsetBegin>0</CharacterOffsetBegin>
                <CharacterOffsetEnd>4</CharacterOffsetEnd>
                <POS>NNP</POS>
                <NER>PERSON</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="2">
                <word>Frost</word>
                <lemma>Frost</lemma>
                <CharacterOffsetBegin>5</CharacterOffsetBegin>
                <CharacterOffsetEnd>10</CharacterOffsetEnd>
                <POS>NNP</POS>
                <NER>PERSON</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="3">
                <word>works</word>
                <lemma>work</lemma>
                <CharacterOffsetBegin>11</CharacterOffsetBegin>
                <CharacterOffsetEnd>16</CharacterOffsetEnd>
                <POS>VBZ</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="4">
                <word>for</word>
                <lemma>for</lemma>
                <CharacterOffsetBegin>17</CharacterOffsetBegin>
                <CharacterOffsetEnd>20</CharacterOffsetEnd>
                <POS>IN</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="5">
                <word>Boeing</word>
                <lemma>Boeing</lemma>
                <CharacterOffsetBegin>21</CharacterOffsetBegin>
                <CharacterOffsetEnd>27</CharacterOffsetEnd>
                <POS>NNP</POS>
                <NER>ORGANIZATION</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="6">
                <word>Company</word>
                <lemma>Company</lemma>
                <CharacterOffsetBegin>28</CharacterOffsetBegin>
                <CharacterOffsetEnd>35</CharacterOffsetEnd>
                <POS>NNP</POS>
                <NER>ORGANIZATION</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="7">
                <word>.</word>
                <lemma>.</lemma>
                <CharacterOffsetBegin>35</CharacterOffsetBegin>
                <CharacterOffsetEnd>36</CharacterOffsetEnd>
                <POS>.</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
            </tokens>
            <parse>(ROOT (S (NP (NNP Jack) (NNP Frost)) (VP (VBZ works) (PP (IN for) (NP (NNP Boeing) (NNP Company)))) (. .))) </parse>
            <dependencies type="basic-dependencies">
              <dep type="root">
                <governor idx="0">ROOT</governor>
                <dependent idx="3">works</dependent>
              </dep>
              <dep type="compound">
                <governor idx="2">Frost</governor>
                <dependent idx="1">Jack</dependent>
              </dep>
              <dep type="nsubj">
                <governor idx="3">works</governor>
                <dependent idx="2">Frost</dependent>
              </dep>
              <dep type="case">
                <governor idx="6">Company</governor>
                <dependent idx="4">for</dependent>
              </dep>
              <dep type="compound">
                <governor idx="6">Company</governor>
                <dependent idx="5">Boeing</dependent>
              </dep>
              <dep type="nmod">
                <governor idx="3">works</governor>
                <dependent idx="6">Company</dependent>
              </dep>
              <dep type="punct">
                <governor idx="3">works</governor>
                <dependent idx="7">.</dependent>
              </dep>
            </dependencies>
            <dependencies type="collapsed-dependencies">
              <dep type="root">
                <governor idx="0">ROOT</governor>
                <dependent idx="3">works</dependent>
              </dep>
              <dep type="compound">
                <governor idx="2">Frost</governor>
                <dependent idx="1">Jack</dependent>
              </dep>
              <dep type="nsubj">
                <governor idx="3">works</governor>
                <dependent idx="2">Frost</dependent>
              </dep>
              <dep type="case">
                <governor idx="6">Company</governor>
                <dependent idx="4">for</dependent>
              </dep>
              <dep type="compound">
                <governor idx="6">Company</governor>
                <dependent idx="5">Boeing</dependent>
              </dep>
              <dep type="nmod:for">
                <governor idx="3">works</governor>
                <dependent idx="6">Company</dependent>
              </dep>
              <dep type="punct">
                <governor idx="3">works</governor>
                <dependent idx="7">.</dependent>
              </dep>
            </dependencies>
            <dependencies type="collapsed-ccprocessed-dependencies">
              <dep type="root">
                <governor idx="0">ROOT</governor>
                <dependent idx="3">works</dependent>
              </dep>
              <dep type="compound">
                <governor idx="2">Frost</governor>
                <dependent idx="1">Jack</dependent>
              </dep>
              <dep type="nsubj">
                <governor idx="3">works</governor>
                <dependent idx="2">Frost</dependent>
              </dep>
              <dep type="case">
                <governor idx="6">Company</governor>
                <dependent idx="4">for</dependent>
              </dep>
              <dep type="compound">
                <governor idx="6">Company</governor>
                <dependent idx="5">Boeing</dependent>
              </dep>
              <dep type="nmod:for">
                <governor idx="3">works</governor>
                <dependent idx="6">Company</dependent>
              </dep>
              <dep type="punct">
                <governor idx="3">works</governor>
                <dependent idx="7">.</dependent>
              </dep>
            </dependencies>
          </sentence>
          <sentence id="2">
            <tokens>
              <token id="1">
                <word>He</word>
                <lemma>he</lemma>
                <CharacterOffsetBegin>37</CharacterOffsetBegin>
                <CharacterOffsetEnd>39</CharacterOffsetEnd>
                <POS>PRP</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="2">
                <word>manages</word>
                <lemma>manage</lemma>
                <CharacterOffsetBegin>40</CharacterOffsetBegin>
                <CharacterOffsetEnd>47</CharacterOffsetEnd>
                <POS>VBZ</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="3">
                <word>5</word>
                <lemma>5</lemma>
                <CharacterOffsetBegin>48</CharacterOffsetBegin>
                <CharacterOffsetEnd>49</CharacterOffsetEnd>
                <POS>CD</POS>
                <NER>NUMBER</NER>
                <NormalizedNER>5.0</NormalizedNER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="4">
                <word>aircraft</word>
                <lemma>aircraft</lemma>
                <CharacterOffsetBegin>50</CharacterOffsetBegin>
                <CharacterOffsetEnd>58</CharacterOffsetEnd>
                <POS>NN</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="5">
                <word>and</word>
                <lemma>and</lemma>
                <CharacterOffsetBegin>59</CharacterOffsetBegin>
                <CharacterOffsetEnd>62</CharacterOffsetEnd>
                <POS>CC</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="6">
                <word>their</word>
                <lemma>they</lemma>
                <CharacterOffsetBegin>63</CharacterOffsetBegin>
                <CharacterOffsetEnd>68</CharacterOffsetEnd>
                <POS>PRP$</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="7">
                <word>crew</word>
                <lemma>crew</lemma>
                <CharacterOffsetBegin>69</CharacterOffsetBegin>
                <CharacterOffsetEnd>73</CharacterOffsetEnd>
                <POS>NN</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="8">
                <word>in</word>
                <lemma>in</lemma>
                <CharacterOffsetBegin>74</CharacterOffsetBegin>
                <CharacterOffsetEnd>76</CharacterOffsetEnd>
                <POS>IN</POS>
                <NER>O</NER>
                <Speaker>PER0</Speaker>
              </token>
              <token id="9">
                <word>London</word>
                <lemma>London</lemma>
                <CharacterOffsetBegin>77</CharacterOffsetBegin>
                <CharacterOffsetEnd>83</CharacterOffsetEnd>
                <POS>NNP</POS>
                <NER>LOCATION</NER>
                <Speaker>PER0</Speaker>
              </token>
            </tokens>
            <parse>(ROOT (S (NP (PRP He)) (VP (VBZ manages) (NP (NP (CD 5) (NN aircraft)) (CC and) (NP (NP (PRP$ their) (NN crew)) (PP (IN in) (NP (NNP London)))))))) </parse>
            <dependencies type="basic-dependencies">
              <dep type="root">
                <governor idx="0">ROOT</governor>
                <dependent idx="2">manages</dependent>
              </dep>
              <dep type="nsubj">
                <governor idx="2">manages</governor>
                <dependent idx="1">He</dependent>
              </dep>
              <dep type="nummod">
                <governor idx="4">aircraft</governor>
                <dependent idx="3">5</dependent>
              </dep>
              <dep type="dobj">
                <governor idx="2">manages</governor>
                <dependent idx="4">aircraft</dependent>
              </dep>
              <dep type="cc">
                <governor idx="4">aircraft</governor>
                <dependent idx="5">and</dependent>
              </dep>
              <dep type="nmod:poss">
                <governor idx="7">crew</governor>
                <dependent idx="6">their</dependent>
              </dep>
              <dep type="conj">
                <governor idx="4">aircraft</governor>
                <dependent idx="7">crew</dependent>
              </dep>
              <dep type="case">
                <governor idx="9">London</governor>
                <dependent idx="8">in</dependent>
              </dep>
              <dep type="nmod">
                <governor idx="7">crew</governor>
                <dependent idx="9">London</dependent>
              </dep>
            </dependencies>
            <dependencies type="collapsed-dependencies">
              <dep type="root">
                <governor idx="0">ROOT</governor>
                <dependent idx="2">manages</dependent>
              </dep>
              <dep type="nsubj">
                <governor idx="2">manages</governor>
                <dependent idx="1">He</dependent>
              </dep>
              <dep type="nummod">
                <governor idx="4">aircraft</governor>
                <dependent idx="3">5</dependent>
              </dep>
              <dep type="dobj">
                <governor idx="2">manages</governor>
                <dependent idx="4">aircraft</dependent>
              </dep>
              <dep type="cc">
                <governor idx="4">aircraft</governor>
                <dependent idx="5">and</dependent>
              </dep>
              <dep type="nmod:poss">
                <governor idx="7">crew</governor>
                <dependent idx="6">their</dependent>
              </dep>
              <dep type="conj:and">
                <governor idx="4">aircraft</governor>
                <dependent idx="7">crew</dependent>
              </dep>
              <dep type="case">
                <governor idx="9">London</governor>
                <dependent idx="8">in</dependent>
              </dep>
              <dep type="nmod:in">
                <governor idx="7">crew</governor>
                <dependent idx="9">London</dependent>
              </dep>
            </dependencies>
            <dependencies type="collapsed-ccprocessed-dependencies">
              <dep type="root">
                <governor idx="0">ROOT</governor>
                <dependent idx="2">manages</dependent>
              </dep>
              <dep type="nsubj">
                <governor idx="2">manages</governor>
                <dependent idx="1">He</dependent>
              </dep>
              <dep type="nummod">
                <governor idx="4">aircraft</governor>
                <dependent idx="3">5</dependent>
              </dep>
              <dep type="dobj">
                <governor idx="2">manages</governor>
                <dependent idx="4">aircraft</dependent>
              </dep>
              <dep type="cc">
                <governor idx="4">aircraft</governor>
                <dependent idx="5">and</dependent>
              </dep>
              <dep type="nmod:poss">
                <governor idx="7">crew</governor>
                <dependent idx="6">their</dependent>
              </dep>
              <dep type="dobj" extra="true">
                <governor idx="2">manages</governor>
                <dependent idx="7">crew</dependent>
              </dep>
              <dep type="conj:and">
                <governor idx="4">aircraft</governor>
                <dependent idx="7">crew</dependent>
              </dep>
              <dep type="case">
                <governor idx="9">London</governor>
                <dependent idx="8">in</dependent>
              </dep>
              <dep type="nmod:in">
                <governor idx="7">crew</governor>
                <dependent idx="9">London</dependent>
              </dep>
            </dependencies>
          </sentence>
        </sentences>
        <coreference>
          <coreference>
            <mention representative="true">
              <sentence>1</sentence>
              <start>1</start>
              <end>3</end>
              <head>2</head>
              <text>Jack Frost</text>
            </mention>
            <mention>
              <sentence>2</sentence>
              <start>1</start>
              <end>2</end>
              <head>1</head>
              <text>He</text>
            </mention>
          </coreference>
        </coreference>
      </document>
    </root>
    
    如果您确实需要python包装器,请参阅

    可能这就是您想要的输出:

    >>> " ".join(token['lemma'] for token in annotated_sent0['tokens'])
    Jack Frost work for Boeing Company
    >>> " ".join(token['word'] for token in annotated_sent0['tokens'])
    Jack Frost works for Boeing Company
    

    如果您想要一个NLTK附带的包装器,那么您只需稍等片刻,直到问题得到解决;P

    一个办法是检查英语词典中的所有名词。专有名词不太可能出现在词典中。在NLP中,这个问题通常被称为错误传播。为什么要对词干词进行NER?通常,这会给你错误的输出。请注意,lemmatizer(语言学上)会比你想要应用的词干分析器更好。从计算上来说,它稍微贵一点,但它仍然是工具链的一部分,不妨使用它。
    $ cd stanford-corenlp-full-2015-12-09
    $ export CLASSPATH=protobuf.jar:joda-time.jar:jollyday.jar:xom-1.2.10.jar:stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:slf4j-api.jar 
    $ java -mx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer &
    cd
    $ git clone https://github.com/smilli/py-corenlp.git
    $ cd py-corenlp
    $ python
    >>> from corenlp import StanfordCoreNLP
    >>> nlp = StanfordCoreNLP('http://localhost:9000')
    >>> text = ("Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London")
    >>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner',  'outputFormat': 'json'})
    >>> output
    {u'sentences': [{u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE', u'index': 0, u'tokens': [{u'index': 1, u'word': u'Jack', u'lemma': u'Jack', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 4, u'characterOffsetBegin': 0, u'originalText': u'Jack', u'ner': u'PERSON', u'before': u''}, {u'index': 2, u'word': u'Frost', u'lemma': u'Frost', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 10, u'characterOffsetBegin': 5, u'originalText': u'Frost', u'ner': u'PERSON', u'before': u' '}, {u'index': 3, u'word': u'works', u'lemma': u'work', u'after': u' ', u'pos': u'VBZ', u'characterOffsetEnd': 16, u'characterOffsetBegin': 11, u'originalText': u'works', u'ner': u'O', u'before': u' '}, {u'index': 4, u'word': u'for', u'lemma': u'for', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 20, u'characterOffsetBegin': 17, u'originalText': u'for', u'ner': u'O', u'before': u' '}, {u'index': 5, u'word': u'Boeing', u'lemma': u'Boeing', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 27, u'characterOffsetBegin': 21, u'originalText': u'Boeing', u'ner': u'ORGANIZATION', u'before': u' '}, {u'index': 6, u'word': u'Company', u'lemma': u'Company', u'after': u'', u'pos': u'NNP', u'characterOffsetEnd': 35, u'characterOffsetBegin': 28, u'originalText': u'Company', u'ner': u'ORGANIZATION', u'before': u' '}, {u'index': 7, u'word': u'.', u'lemma': u'.', u'after': u' ', u'pos': u'.', u'characterOffsetEnd': 36, u'characterOffsetBegin': 35, u'originalText': u'.', u'ner': u'O', u'before': u''}]}, {u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE', u'index': 1, u'tokens': [{u'index': 1, u'word': u'He', u'lemma': u'he', u'after': u' ', u'pos': u'PRP', u'characterOffsetEnd': 39, u'characterOffsetBegin': 37, u'originalText': u'He', u'ner': u'O', u'before': u' '}, {u'index': 2, u'word': u'manages', u'lemma': u'manage', u'after': u' ', u'pos': u'VBZ', u'characterOffsetEnd': 47, u'characterOffsetBegin': 40, u'originalText': u'manages', u'ner': u'O', u'before': u' '}, {u'index': 3, u'after': u' ', u'word': u'5', u'lemma': u'5', u'normalizedNER': u'5.0', u'pos': u'CD', u'characterOffsetEnd': 49, u'characterOffsetBegin': 48, u'originalText': u'5', u'ner': u'NUMBER', u'before': u' '}, {u'index': 4, u'word': u'aircraft', u'lemma': u'aircraft', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 58, u'characterOffsetBegin': 50, u'originalText': u'aircraft', u'ner': u'O', u'before': u' '}, {u'index': 5, u'word': u'and', u'lemma': u'and', u'after': u' ', u'pos': u'CC', u'characterOffsetEnd': 62, u'characterOffsetBegin': 59, u'originalText': u'and', u'ner': u'O', u'before': u' '}, {u'index': 6, u'word': u'their', u'lemma': u'they', u'after': u' ', u'pos': u'PRP$', u'characterOffsetEnd': 68, u'characterOffsetBegin': 63, u'originalText': u'their', u'ner': u'O', u'before': u' '}, {u'index': 7, u'word': u'crew', u'lemma': u'crew', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 73, u'characterOffsetBegin': 69, u'originalText': u'crew', u'ner': u'O', u'before': u' '}, {u'index': 8, u'word': u'in', u'lemma': u'in', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 76, u'characterOffsetBegin': 74, u'originalText': u'in', u'ner': u'O', u'before': u' '}, {u'index': 9, u'word': u'London', u'lemma': u'London', u'after': u'', u'pos': u'NNP', u'characterOffsetEnd': 83, u'characterOffsetBegin': 77, u'originalText': u'London', u'ner': u'LOCATION', u'before': u' '}]}]}
    >>> annotated_sent0 = output['sentences'][0]
    >>> for token in annotated_sent0['tokens']:
    ...     print token['word'], token['lemma'], token['pos'], token['ner']
    ... 
    Jack Jack NNP PERSON
    Frost Frost NNP PERSON
    works work VBZ O
    for for IN O
    Boeing Boeing NNP ORGANIZATION
    Company Company NNP ORGANIZATION
    . . . O
    
    >>> " ".join(token['lemma'] for token in annotated_sent0['tokens'])
    Jack Frost work for Boeing Company
    >>> " ".join(token['word'] for token in annotated_sent0['tokens'])
    Jack Frost works for Boeing Company