Python 如何为spacy NLP创建字典?

Python 如何为spacy NLP创建字典?,python,dictionary,nlp,spacy,Python,Dictionary,Nlp,Spacy,我将使用spaCy NLP引擎,我从字典开始。我已经读过了,但不能开始读了 我有以下代码: from spacy.en import English import _regex parser = English() # Test Data multiSentence = "There is an art, it says, or rather, a knack to flying." \ "The knack lies in learning how to t

我将使用spaCy NLP引擎,我从字典开始。我已经读过了,但不能开始读了

我有以下代码:

from spacy.en import English
import _regex
parser = English()

# Test Data
multiSentence = "There is an art, it says, or rather, a knack to flying." \
                 "The knack lies in learning how to throw yourself at the ground and miss." \
                 "In the beginning the Universe was created. This has made a lot of people "\
                 "very angry and been widely regarded as a bad move."
parsedData = parser(multiSentence)
for i, token in enumerate(parsedData):
    print("original:", token.orth, token.orth_)
    print("lowercased:", token.lower, token.lower_)
    print("lemma:", token.lemma, token.lemma_)
    print("shape:", token.shape, token.shape_)
    print("prefix:", token.prefix, token.prefix_)
    print("suffix:", token.suffix, token.suffix_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    print("----------------------------------------")
    if i > 1:
        break

# Let's look at the sentences
sents = []
for span in parsedData.sents:
    # go from the start to the end of each span, returning each token in the sentence
    # combine each token using join()
    sent = ''.join(parsedData[i].string for i in range(span.start, span.end)).strip()
    sents.append(sent)

print('To show sentence')
for sentence in sents:
    print(sentence)


# Let's look at the part of speech tags of the first sentence
for span in parsedData.sents:
    sent = [parsedData[i] for i in range(span.start, span.end)]
    break

for token in sent:
    print(token.orth_, token.pos_)

# Let's look at the dependencies of this example:
example = "The boy with the spotted dog quickly ran after the firetruck."
parsedEx = parser(example)
# shown as: original token, dependency tag, head word, left dependents, right dependents
for token in parsedEx:
    print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights])

# Let's look at the named entities of this example:
example = "Apple's stocks dropped dramatically after the death of Steve Jobs in October."
parsedEx = parser(example)
for token in parsedEx:
    print(token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)")

print("-------------- entities only ---------------")
# if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
ents = list(parsedEx.ents)
for entity in ents:
    print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))

messyData = "lol that is rly funny :) This is gr8 i rate it 8/8!!!"
parsedData = parser(messyData)
for token in parsedData:
    print(token.orth_, token.pos_, token.lemma_)
我可以在哪里更改这些令牌(token.orth、token.orth_等):


我能把这些代币保存在自己的字典里吗?感谢您的帮助

不清楚您需要什么样的数据结构,但让我们试着回答一些问题

问:我在哪里可以更改这些令牌(token.orth,token.orth_389;…)? 这些标记不应更改,因为它们是由英文模型从
spacy
创建的注释。(见定义)

有关单个注释含义的详细信息,请参见

问:但是我们可以更改这些标记的注释吗? 可能是,也可能不是

查看代码,我们发现该类是一个相当复杂的Cython对象:

cdef class Doc:
    """
    A sequence of `Token` objects. Access sentences and named entities,
    export annotations to numpy arrays, losslessly serialize to compressed
    binary strings.
    Aside: Internals
        The `Doc` object holds an array of `TokenC` structs.
        The Python-level `Token` and `Span` objects are views of this
        array, i.e. they don't own the data themselves.
    Code: Construction 1
        doc = nlp.tokenizer(u'Some text')
    Code: Construction 2
        doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
    """
但总的来说,它是一个对象序列,其中包含一个与对象紧密相关的固有属性

首先,让我们看看这些注释中是否有一些是可变的。让我们从POS标签开始:

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp('This is a foo bar sentence.')

>>> type(doc[0]) # First word. 
<class 'spacy.tokens.token.Token'>

>>> dir(doc[0]) # Properties/functions available for the Token object. 
['__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_type', 'ent_type_', 'has_repvec', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ancestor_of', 'is_ascii', 'is_bracket', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_space', 'is_stop', 'is_title', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'repvec', 'right_edge', 'rights', 'sentiment', 'shape', 'shape_', 'similarity', 'string', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_']

# The POS tag assigned by spacy's model.
>>> doc[0].tag_ 
'DT'

# Let's try to override it.
>>> doc[0].tag_ = 'NN'

# It works!!!
>>> doc[0].tag_
'NN'

# What if we overwrite index of the tag_ rather than the form?
>>> doc[0].tag
474
>>> doc[0].tag = 123
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/token.pyx", line 206, in spacy.tokens.token.Token.tag.__set__ (spacy/tokens/token.cpp:6755)
  File "spacy/morphology.pyx", line 64, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:4540)
KeyError: 123
>>> doc[0].tag = 352
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/token.pyx", line 206, in spacy.tokens.token.Token.tag.__set__ (spacy/tokens/token.cpp:6755)
  File "spacy/morphology.pyx", line 64, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:4540)
KeyError: 352
现在我们看到有一些标记的注释,比如
。或者
被保护不被覆盖。这很可能是因为它会破坏令牌映射回输入字符串原始偏移量的方式

Ans:似乎令牌对象的某些属性可以更改,而有些属性不能更改

Q:那么哪些令牌属性可以更改,哪些不能更改? 检查这一点的一个简单方法是在中的Cython属性中查找
\uuuu set\uuuu
函数

这将允许可变变量,最有可能的是可以覆盖/更改的令牌属性

例如

我们将看到
.tag_
.lemma_
是可变的,但是
.pos_
不是:

>>> doc[0].lemma_
'this'
>>> doc[0].lemma_ = 'that'
>>> doc[0].lemma_
'that'

>>> doc[0].tag_ 
'DT'
>>> doc[0].tag_ = 'NN'
>>> doc[0].tag_
'NN'

>>> doc[0].pos_
'NOUN'
>>> doc[0].pos_ = 'VERB'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: attribute 'pos_' of 'spacy.tokens.token.Token' objects is not writable

目前还不清楚您需要什么样的数据结构,但让我们试着回答一些问题

问:我在哪里可以更改这些令牌(token.orth,token.orth_389;…)? 这些标记不应更改,因为它们是由英文模型从
spacy
创建的注释。(见定义)

有关单个注释含义的详细信息,请参见

问:但是我们可以更改这些标记的注释吗? 可能是,也可能不是

查看代码,我们发现该类是一个相当复杂的Cython对象:

cdef class Doc:
    """
    A sequence of `Token` objects. Access sentences and named entities,
    export annotations to numpy arrays, losslessly serialize to compressed
    binary strings.
    Aside: Internals
        The `Doc` object holds an array of `TokenC` structs.
        The Python-level `Token` and `Span` objects are views of this
        array, i.e. they don't own the data themselves.
    Code: Construction 1
        doc = nlp.tokenizer(u'Some text')
    Code: Construction 2
        doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
    """
但总的来说,它是一个对象序列,其中包含一个与对象紧密相关的固有属性

首先,让我们看看这些注释中是否有一些是可变的。让我们从POS标签开始:

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp('This is a foo bar sentence.')

>>> type(doc[0]) # First word. 
<class 'spacy.tokens.token.Token'>

>>> dir(doc[0]) # Properties/functions available for the Token object. 
['__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_type', 'ent_type_', 'has_repvec', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ancestor_of', 'is_ascii', 'is_bracket', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_space', 'is_stop', 'is_title', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'repvec', 'right_edge', 'rights', 'sentiment', 'shape', 'shape_', 'similarity', 'string', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_']

# The POS tag assigned by spacy's model.
>>> doc[0].tag_ 
'DT'

# Let's try to override it.
>>> doc[0].tag_ = 'NN'

# It works!!!
>>> doc[0].tag_
'NN'

# What if we overwrite index of the tag_ rather than the form?
>>> doc[0].tag
474
>>> doc[0].tag = 123
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/token.pyx", line 206, in spacy.tokens.token.Token.tag.__set__ (spacy/tokens/token.cpp:6755)
  File "spacy/morphology.pyx", line 64, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:4540)
KeyError: 123
>>> doc[0].tag = 352
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/token.pyx", line 206, in spacy.tokens.token.Token.tag.__set__ (spacy/tokens/token.cpp:6755)
  File "spacy/morphology.pyx", line 64, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:4540)
KeyError: 352
现在我们看到有一些标记的注释,比如
。或者
被保护不被覆盖。这很可能是因为它会破坏令牌映射回输入字符串原始偏移量的方式

Ans:似乎令牌对象的某些属性可以更改,而有些属性不能更改

Q:那么哪些令牌属性可以更改,哪些不能更改? 检查这一点的一个简单方法是在中的Cython属性中查找
\uuuu set\uuuu
函数

这将允许可变变量,最有可能的是可以覆盖/更改的令牌属性

例如

我们将看到
.tag_
.lemma_
是可变的,但是
.pos_
不是:

>>> doc[0].lemma_
'this'
>>> doc[0].lemma_ = 'that'
>>> doc[0].lemma_
'that'

>>> doc[0].tag_ 
'DT'
>>> doc[0].tag_ = 'NN'
>>> doc[0].tag_
'NN'

>>> doc[0].pos_
'NOUN'
>>> doc[0].pos_ = 'VERB'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: attribute 'pos_' of 'spacy.tokens.token.Token' objects is not writable

你能进一步解释一下你在所需词典中需要什么吗?你能进一步解释一下你在所需词典中需要什么吗?
>>> import pickle
>>> import spacy

>>> nlp = spacy.load('en')
>>> doc = nlp('This is a foo bar sentence.')

>>> doc
This is a foo bar sentence.

# Pickle the Doc object.
>>> pickle.dump(doc, open('spacy_processed_doc.pkl', 'wb'))

# Now you see me.
>>> doc
This is a foo bar sentence.
# Now you don't
>>> doc = None
>>> doc

# Let's load the saved pickle.
>>> doc = pickle.load(open('spacy_processed_doc.pkl', 'rb'))
>>> doc

>>> type(doc)
<class 'spacy.tokens.doc.Doc'>
>>> doc[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 185, in spacy.tokens.doc.Doc.__getitem__ (spacy/tokens/doc.cpp:5550)
TypeError: 'NoneType' object is not subscriptable