Python 3.x 为什么spaCy不像Stanford CoreNLP那样在标记化过程中保留字内连字符？_Python 3.x_Nlp_Spacy

Python 3.x 为什么spaCy不像Stanford CoreNLP那样在标记化过程中保留字内连字符？

python-3.x nlp

Python 3.x 为什么spaCy不像Stanford CoreNLP那样在标记化过程中保留字内连字符？,python-3.x,nlp,spacy,Python 3.x,Nlp,Spacy,SpaCy版本：2.0.11 Python版本：3.6.5 操作系统：Ubuntu 16.04 我的句子样本：营销代表-不会死于车祸。或开箱即用实施预期代币： [“营销代表”、“-”、“和”、“不”、“死”、“在”、“汽车”、“事故”、“意外”] [“开箱即用”、“实施”] 空间标记（默认标记器）： [“营销”、“-”、“代表-”、“我”、“不”、“死”、“在”、“汽车”、“事故”、“] [“Out”、“-”、“of”、“-”、“box”、“implementation”] 我尝试创建自

SpaCy版本：2.0.11

Python版本：3.6.5

操作系统：Ubuntu 16.04

我的句子样本：

营销代表-不会死于车祸。

或

开箱即用实施

预期代币：

[“营销代表”、“-”、“和”、“不”、“死”、“在”、“汽车”、“事故”、“意外”]

[“开箱即用”、“实施”]

空间标记（默认标记器）：

[“营销”、“-”、“代表-”、“我”、“不”、“死”、“在”、“汽车”、“事故”、“]

[“Out”、“-”、“of”、“-”、“box”、“implementation”]

我尝试创建自定义标记器，但它无法处理spaCy使用标记器_异常处理的所有边缘情况（代码如下）：

输出：

Marketing-Representative-
won
'
t
die
in
car
accident
.

我需要有人来指导我如何做这件事

无论是在上面的正则表达式中进行更改，还是任何其他方法，我甚至尝试了spaCy的基于规则的匹配器，但无法创建规则来处理两个以上单词之间的连字符，例如“开箱即用”，以便创建匹配器与span.merge（）一起使用

无论哪种方式，我都需要让包含单词内连字符的单词成为Stanford CoreNLP处理的单个标记。

尽管未在

spacey

中记录

看起来我们只需要为我们正在使用的*补丁添加

regex

，在本例中是中缀

此外，我们似乎可以使用自定义的

regex扩展nlp.Defaults.prefixes

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

这会给你想要的结果。不需要将默认值设置为前缀
和后缀
，因为我们不使用这些
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re

nlp = spacy.load('en')

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)

s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"

for s in s1,s2:
    doc = nlp("{}".format(s))
    print([token.text for token in doc])

结果
$python3 /tmp/nlp.py  
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
['Out-of-box', 'implementation']  

您可能希望修复addon regex，使其对其他类型的令牌（接近应用的regex）更加健壮
 我还想修改spaCy的标记器，使其更接近CoreNLP的语义。下面粘贴的是我提出的，它解决了这个线程中的连字符问题（包括后面的hypens）和一些额外的修复。我必须复制默认的中缀表达式并对其进行修改，但可以简单地附加一个新的后缀表达式：

import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS

def initializeTokenizer(nlp):

    prefixes = nlp.Defaults.prefixes 
    
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r'(?<=[0-9])[+\-\*^](?=[0-9-])',
            r'(?<=[{al}{q}])\.(?=[{au}{q}])'.format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            # REMOVE: commented out regex that splits on hyphens between letters:
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            # EDIT: remove split on slash between letters, and add comma
            #r'(?<=[{a}0-9])[:<>=/](?=[{a}])'.format(a=ALPHA),
            r'(?<=[{a}0-9])[:<>=,](?=[{a}])'.format(a=ALPHA),
            # ADD: ampersand as an infix character except for dual upper FOO&FOO variant
            r'(?<=[{a}0-9])[&](?=[{al}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
            r'(?<=[{al}0-9])[&](?=[{a}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
        ]
    )

    # ADD: add suffix to split on trailing hyphen
    custom_suffixes = [r'[-]']
    suffixes = nlp.Defaults.suffixes
    suffixes = tuple(list(suffixes) + custom_suffixes)

    infix_re = spacy.util.compile_infix_regex(infixes)
    suffix_re = spacy.util.compile_suffix_regex(suffixes)

    nlp.tokenizer.suffix_search = suffix_re.search
    nlp.tokenizer.infix_finditer = infix_re.finditer


进口空间
从spacy.lang.char_类导入ALPHA、ALPHA_LOWER、ALPHA_UPPER
从spacy.lang.char_类导入CONCAT_引号、列表椭圆、列表图标
def初始值设定项（nlp）：
前缀=nlp.Defaults.prefixes
中缀=(
列表椭圆
+列出图标
+ [
r'（？感谢您的回复。您的解决方案运行良好，尽管我仍然无法修复自定义标记器生成的标记（“营销代表-”）中的尾随连字符。但我正在处理它。为什么要执行以下操作：infixes=nlp.Defaults.prefixes+（r“[./]”，r“[-]~”，r“（.”）而不仅仅是以下内容：中缀=nlp.Defaults.prefixes+（r“[-]~”）？中缀=nlp.Defaults.prefixes+（r“[./]”，r“[-]~”，r“（.”）的第一个和最后一个模式是什么？

import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS

def initializeTokenizer(nlp):

    prefixes = nlp.Defaults.prefixes 
    
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r'(?<=[0-9])[+\-\*^](?=[0-9-])',
            r'(?<=[{al}{q}])\.(?=[{au}{q}])'.format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            # REMOVE: commented out regex that splits on hyphens between letters:
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            # EDIT: remove split on slash between letters, and add comma
            #r'(?<=[{a}0-9])[:<>=/](?=[{a}])'.format(a=ALPHA),
            r'(?<=[{a}0-9])[:<>=,](?=[{a}])'.format(a=ALPHA),
            # ADD: ampersand as an infix character except for dual upper FOO&FOO variant
            r'(?<=[{a}0-9])[&](?=[{al}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
            r'(?<=[{al}0-9])[&](?=[{a}0-9])'.format(a=ALPHA, al=ALPHA_LOWER),
        ]
    )

    # ADD: add suffix to split on trailing hyphen
    custom_suffixes = [r'[-]']
    suffixes = nlp.Defaults.suffixes
    suffixes = tuple(list(suffixes) + custom_suffixes)

    infix_re = spacy.util.compile_infix_regex(infixes)
    suffix_re = spacy.util.compile_suffix_regex(suffixes)

    nlp.tokenizer.suffix_search = suffix_re.search
    nlp.tokenizer.infix_finditer = infix_re.finditer