Python 为什么CoreNLP-ner-tagger和ner-tagger将分开的数字连接在一起？_Python_Nlp_Nltk_Stanford Nlp_Pycorenlp

Python 为什么CoreNLP-ner-tagger和ner-tagger将分开的数字连接在一起？

python nlp stanford-nlp

Python 为什么CoreNLP-ner-tagger和ner-tagger将分开的数字连接在一起？,python,nlp,nltk,stanford-nlp,pycorenlp,Python,Nlp,Nltk,Stanford Nlp,Pycorenlp,以下是代码片段： In [390]: t Out[390]: ['my', 'phone', 'number', 'is', '1111', '1111', '1111'] In [391]: ner_tagger.tag(t) Out[391]: [('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111\xa01111\xa01111', 'NUMBER')] 我所期望的是： Out[391]: [('

以下是代码片段：

In [390]: t
Out[390]: ['my', 'phone', 'number', 'is', '1111', '1111', '1111']

In [391]: ner_tagger.tag(t)
Out[391]: 
[('my', 'O'),
 ('phone', 'O'),
 ('number', 'O'),
 ('is', 'O'),
 ('1111\xa01111\xa01111', 'NUMBER')]

我所期望的是：

Out[391]: 
[('my', 'O'),
 ('phone', 'O'),
 ('number', 'O'),
 ('is', 'O'),
 ('1111', 'NUMBER'),
 ('1111', 'NUMBER'),
 ('1111', 'NUMBER')]

正如您所看到的，人造电话号码由\xa0连接，它被称为不间断空格。我可以通过设置CoreNLP而不更改其他默认规则来区分这一点吗

ner_标记器的定义如下：

ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')

TL；博士 NLTK将令牌列表读入字符串，然后将其传递给CoreNLP服务器。CoreNLP将输入重新排序，并将数字标记与

\xa0

（非中断空格）连接起来

长期让我们浏览一下代码，如果我们查看

corenlparser

中的

tag（）

函数，我们会看到它调用

tag\u sents（）

函数，并在调用

raw\u tag\u sents（）

之前将字符串的输入列表转换为字符串，从而允许

corenlparser

重新标记输入，请参阅：

调用时，

raw\u tag\u sents（）

使用

api\u call（）

将输入传递给服务器：

所以问题是如何解决问题并在传递令牌时获取令牌？

如果我们查看CoreNLP中标记器的选项，就会看到

tokenize.whitespace

选项：

如果我们在调用

api_call（）

之前对allow additional

属性进行一些更改，我们可以在将令牌传递到由空格连接的CoreNLP服务器时强制执行令牌，例如，对代码的更改：
def tag_sents(self, sentences, properties=None):
    """
    Tag multiple sentences.

    Takes multiple sentences as a list where each sentence is a list of
    tokens.

    :param sentences: Input sentences to tag
    :type sentences: list(list(str))
    :rtype: list(list(tuple(str, str))
    """
    # Converting list(list(str)) -> list(str)
    sentences = (' '.join(words) for words in sentences)
    if properties == None:
        properties = {'tokenize.whitespace':'true'}
    return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]

def tag(self, sentence, properties=None):
    """
    Tag a list of tokens.

    :rtype: list(tuple(str, str))

    >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
    >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
    >>> parser.tag(tokens)
    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
    ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

    >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
    >>> tokens = "What is the airspeed of an unladen swallow ?".split()
    >>> parser.tag(tokens)
    [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
    ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
    ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
    """
    return self.tag_sents([sentence], properties)[0]

def raw_tag_sents(self, sentences, properties=None):
    """
    Tag multiple sentences.

    Takes multiple sentences as a list where each sentence is a string.

    :param sentences: Input sentences to tag
    :type sentences: list(str)
    :rtype: list(list(list(tuple(str, str)))
    """
    default_properties = {'ssplit.isOneSentence': 'true',
                          'annotators': 'tokenize,ssplit,' }

    default_properties.update(properties or {})

    # Supports only 'pos' or 'ner' tags.
    assert self.tagtype in ['pos', 'ner']
    default_properties['annotators'] += self.tagtype
    for sentence in sentences:
        tagged_data = self.api_call(sentence, properties=default_properties)
        yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
                for tagged_sentence in tagged_data['sentences']]

更改上述代码后：
>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'DATE'), ('1111', 'DATE'), ('1111', 'DATE')]

您能展示一下如何调用CoreNLPParser的完整Python代码吗？否则就没有足够的信息来解释为什么会发生这种情况。@alvas是的。我已经更新了。啊，现在这很有趣=）谢谢你抓到这个！在会议上提出的问题
def tag_sents(self, sentences, properties=None):
    """
    Tag multiple sentences.

    Takes multiple sentences as a list where each sentence is a list of
    tokens.

    :param sentences: Input sentences to tag
    :type sentences: list(list(str))
    :rtype: list(list(tuple(str, str))
    """
    # Converting list(list(str)) -> list(str)
    sentences = (' '.join(words) for words in sentences)
    if properties == None:
        properties = {'tokenize.whitespace':'true'}
    return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]

def tag(self, sentence, properties=None):
    """
    Tag a list of tokens.

    :rtype: list(tuple(str, str))

    >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
    >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
    >>> parser.tag(tokens)
    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
    ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

    >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
    >>> tokens = "What is the airspeed of an unladen swallow ?".split()
    >>> parser.tag(tokens)
    [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
    ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
    ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
    """
    return self.tag_sents([sentence], properties)[0]

def raw_tag_sents(self, sentences, properties=None):
    """
    Tag multiple sentences.

    Takes multiple sentences as a list where each sentence is a string.

    :param sentences: Input sentences to tag
    :type sentences: list(str)
    :rtype: list(list(list(tuple(str, str)))
    """
    default_properties = {'ssplit.isOneSentence': 'true',
                          'annotators': 'tokenize,ssplit,' }

    default_properties.update(properties or {})

    # Supports only 'pos' or 'ner' tags.
    assert self.tagtype in ['pos', 'ner']
    default_properties['annotators'] += self.tagtype
    for sentence in sentences:
        tagged_data = self.api_call(sentence, properties=default_properties)
        yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
                for tagged_sentence in tagged_data['sentences']]

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'DATE'), ('1111', 'DATE'), ('1111', 'DATE')]