Python spacy柠檬化机是如何工作的?

Python spacy柠檬化机是如何工作的?,python,nlp,wordnet,spacy,lemmatization,Python,Nlp,Wordnet,Spacy,Lemmatization,对于引理化,spacy有一个:形容词,副词,动词。。。同时也列出了例外情况:副词。。。对于普通的,有一套 让我们以“更宽”一词为例 因为它是一个形容词,所以柠檬化的规则应该取自以下列表: ADJECTIVE_RULES = [ ["er", ""], ["est", ""], ["er", "e"], ["est", "e"] ] 据我所知,过程如下: 1) 获取单词的词性标签,了解它是名词还是动词… 2) 如果单词在不规则情况列表中,如果没有应用任何规则,则

对于引理化,spacy有一个:形容词,副词,动词。。。同时也列出了例外情况:副词。。。对于普通的,有一套

让我们以“更宽”一词为例

因为它是一个形容词,所以柠檬化的规则应该取自以下列表:

ADJECTIVE_RULES = [
    ["er", ""],
    ["est", ""],
    ["er", "e"],
    ["est", "e"]
] 
据我所知,过程如下:

1) 获取单词的词性标签,了解它是名词还是动词…
2) 如果单词在不规则情况列表中,如果没有应用任何规则,则直接替换

现在,如何决定使用“er”->“e”而不是“er”->“来获得“宽”而不是“宽”


它可以被测试。

spaCy检查它试图生成的引理是否在该词类的已知单词列表或异常中

长答案:

查看该文件,特别是底部的
lemmatize
函数

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)
例如,对于英语形容词,它包含我们正在评估的字符串、已知形容词的
索引
、异常
、规则
,如您所引用的(对于英语模型)

将字符串小写后,我们在
lemmatize
中要做的第一件事是检查字符串是否在已知异常列表中,其中包括诸如“更坏”->“坏”等词的引理规则

然后我们检查
规则
,并将每个规则应用于字符串(如果适用)。对于单词
更广泛的
,我们将应用以下规则:

["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
我们将输出以下表单:
[“wid”,“wide”]

然后,我们检查这个形式是否在已知形容词的
索引中。如果是,我们将其附加到表单中。否则,我们将其添加到
oov_表单
,我猜这是词汇表外的缩写
wide
在索引中,因此会添加它
wid
被添加到
oov\u表单中

最后,我们返回一组找到的引理,或者任何匹配规则但不在索引中的引理,或者只返回单词本身


您在上面发布的单词lemmatize链接适用于
wide
,因为
wide
在单词索引中。试着做一些类似的事情,比如他比我更文雅。
spaCy会把
blandier
(我编的词)标记为一个形容词,但它不在索引中,所以它只会返回
blandier
作为引理。

每个词类型(形容词、名词、动词、副词)都有一套规则和一套已知的词。映射发生在:

然后在lemmatizer.py中加载正确的索引、规则和exc(我认为excl代表例外情况,例如不规则示例):

lemmas = lemmatize(string, self.index.get(univ_pos, {}),
                   self.exc.get(univ_pos, {}),
                   self.rules.get(univ_pos, []))
所有剩余的逻辑都在函数中,并且异常简短。我们执行以下操作:

  • 如果存在异常(即单词不规则),包括提供的字符串,请使用它并将其添加到柠檬化表单中
  • 对于为所选单词类型指定的顺序中的每个规则,请检查其是否与给定单词匹配。如果它确实尝试应用它

    2a。如果应用该规则后,该词位于已知词列表(即索引)中,则将其添加到该词的柠檬化形式中

    2b。否则,将该词添加到一个名为
    oov\u forms
    的单独列表中(这里我认为oov代表“词汇表外”)

  • 如果我们使用上述规则找到了至少一个表单,我们将返回找到的表单列表,否则将返回oov_表单列表
    让我们从类定义开始:

    等级 它从初始化3个变量开始:

    class Lemmatizer(object):
        @classmethod
        def load(cls, path, index=None, exc=None, rules=None):
            return cls(index or {}, exc or {}, rules or {})
    
        def __init__(self, index, exceptions, rules):
            self.index = index
            self.exc = exceptions
            self.rules = rules
    
    现在,查看英文版的
    self.exc
    ,我们看到它指向从目录加载文件的位置

    为什么Spacy不读一个文件呢? 很可能是因为在代码中声明字符串比通过I/O流传输字符串更快


    这些索引、例外和规则从何而来? 仔细看,它们似乎都来自最初的普林斯顿WordNet

    规则

    更仔细地看,上的规则类似于
    nltk中的
    \u morphy
    规则

    这些规则最初来自于
    Morphy
    软件

    此外,
    spacy
    还包含了一些非普林斯顿Morphy的标点符号规则:

    PUNCT_RULES = [
        ["“", "\""],
        ["”", "\""],
        ["\u2018", "'"],
        ["\u2019", "'"]
    ]
    
    例外情况

    至于异常,它们存储在
    spacy
    中的
    *\u ireg.py
    文件中,看起来它们也来自普林斯顿Wordnet

    很明显,如果我们查看原始WordNet
    .exc
    (排除)文件(例如)的镜像,如果您从
    nltk
    下载
    WordNet
    包,我们会看到相同的列表:

    alvas@ubi:~/nltk_data/corpora/wordnet$ ls
    adj.exc       cntlist.rev  data.noun  index.adv    index.verb  noun.exc
    adv.exc       data.adj     data.verb  index.noun   lexnames    README
    citation.bib  data.adv     index.adj  index.sense  LICENSE     verb.exc
    alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc 
    1490 adj.exc
    
    索引

    如果我们查看
    spacy
    lemmatizer的
    索引,我们会发现它也来自Wordnet,例如
    nltk
    中重新分发的Wordnet副本:

    alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj 
    
      1 This software and database is being provided to you, the LICENSEE, by  
      2 Princeton University under the following license.  By obtaining, using  
      3 and/or copying this software and database, you agree that you have  
      4 read, understood, and will comply with these terms and conditions.:  
      5   
      6 Permission to use, copy, modify and distribute this software and  
      7 database and its documentation for any purpose and without fee or  
      8 royalty is hereby granted, provided that you agree to comply with  
      9 the following copyright notice and statements, including the disclaimer,  
      10 and that the same appear on ALL copies of the software, database and  
      11 documentation, including modifications that you make for internal  
      12 use or for distribution.  
      13   
      14 WordNet 3.0 Copyright 2006 by Princeton University.  All rights reserved.  
      15   
      16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
      17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
      18 IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
      19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
      20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
      21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
      22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
      23 OTHER RIGHTS.  
      24   
      25 The name of Princeton University or Princeton may not be used in  
      26 advertising or publicity pertaining to distribution of the software  
      27 and/or database.  Title to copyright in this software, database and  
      28 any associated documentation shall at all times remain with  
      29 Princeton University and LICENSEE agrees to preserve same.  
    00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"  
    00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"  
    00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"  
    00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"  
    00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex  
    00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base  
    00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part  
    00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part  
    00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 |  being born or beginning; "the nascent chicks"; "a nascent insurgency"   
    00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"  
    00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels  
    

    基于
    spacy
    lemmatizer使用的字典、异常和规则主要来自普林斯顿WordNet及其Morphy软件,我们可以继续了解
    spacy
    如何使用索引和异常应用规则的实际实现

    我们回到过去

    主要操作来自函数,而不是
    Lemmatizer
    类:

    def lemmatize(string, index, exceptions, rules):
        string = string.lower()
        forms = []
        # TODO: Is this correct? See discussion in Issue #435.
        #if string in index:
        #    forms.append(string)
        forms.extend(exceptions.get(string, []))
        oov_forms = []
        for old, new in rules:
            if string.endswith(old):
                form = string[:len(string) - len(old)] + new
                if not form:
                    pass
                elif form in index or not form.isalpha():
                    forms.append(form)
                else:
                    oov_forms.append(form)
        if not forms:
            forms.extend(oov_forms)
        if not forms:
            forms.append(string)
        return set(forms)
    
    为什么
    lemmatize
    方法在
    Lemmatizer
    类之外? 我不是很确定,但可能是为了确保可以在类实例之外调用lemmatization函数,但如果存在,那么可能还有其他考虑因素,比如为什么函数和类是解耦的

    变形与空间 将
    spacy
    lemmatize()函数与nltk中的函数(最初来自十多年前创建的函数)进行比较,
    morphy()
    
    def lemmatize(string, index, exceptions, rules):
        string = string.lower()
        forms = []
        # TODO: Is this correct? See discussion in Issue #435.
        #if string in index:
        #    forms.append(string)
        forms.extend(exceptions.get(string, []))
        oov_forms = []
        for old, new in rules:
            if string.endswith(old):
                form = string[:len(string) - len(old)] + new
                if not form:
                    pass
                elif form in index or not form.isalpha():
                    forms.append(form)
                else:
                    oov_forms.append(form)
        if not forms:
            forms.extend(oov_forms)
        if not forms:
            forms.append(string)
        return set(forms)
    
    >>> from nltk.stem import WordNetLemmatizer
    >>> wnl = WordNetLemmatizer()
    >>> wnl.lemmatize('alvations')
    'alvations'