Python 使用spacy时如何解决属性错误？_Python_Spacy

Python 使用spacy时如何解决属性错误？

python

Python 使用spacy时如何解决属性错误？,python,spacy,Python,Spacy,我正在使用spacy进行德语自然语言处理。但我遇到了这个错误： AttributeError: 'str' object has no attribute 'text' 这是我正在处理的文本数据： tex = ['Wir waren z.B. früher auf\'m Fahrrad unterwegs in München (immer nach 11 Uhr).', 'Nun fahren wir öfter mit der S-Bahn in München heru

我正在使用spacy进行德语自然语言处理。但我遇到了这个错误：

AttributeError: 'str' object has no attribute 'text'

这是我正在处理的文本数据：

tex = ['Wir waren z.B. früher auf\'m Fahrrad unterwegs in München (immer nach 11 Uhr).',
        'Nun fahren wir öfter mit der S-Bahn in München herum. Tja. So ist das eben.',
        'So bleibt mir nichts anderes übrig als zu sagen, vielen Dank für alles.',
        'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.']

我的代码：

data = [re.sub(r"\"", "", i) for i in tex]
data1 = [re.sub(r"\“", "", i) for i in data]
data2 = [re.sub(r"\„", "", i) for i in data1]

nlp = spacy.load('de')
spacy_doc1 = []
for line in data2:
    spac = nlp(line)
    lem = [tok.lemma_ for tok in spac]
    no_punct = [tok.text for tok in lem if re.match('\w+', tok.text)]
    no_numbers = [tok for tok in no_punct if not re.match('\d+', tok)]

我在一个单独的列表中写入每个字符串，因为我需要将处理结果分配给原始的特定字符串

我还了解，写入

lem

的结果不再是spacy可以处理的格式

那么我如何才能正确地做到这一点呢？

这里的问题在于SpaCy的

token.lemma

返回一个字符串，而字符串没有

text

属性（如错误所述）

我建议你也这样写：

no_编号=[tok for tok in no_putch if not re.match（'\d+'，tok）]

代码中这一行的唯一区别是，如果遇到英语代词，必须包含特殊字符串

“-PRON-”

：

import re
import spacy

# using the web English model for practicality here
nlp = spacy.load('en_core_web_sm')

tex = ['I\'m going to get a cat tomorrow',
        'I don\'t know if I\'ll be able to get him a cat house though!']

data = [re.sub(r"\"", "", i) for i in tex]
data1 = [re.sub(r"\“", "", i) for i in data]
data2 = [re.sub(r"\„", "", i) for i in data1]

spacy_doc1 = []

for line in data2:
    spac = nlp(line)
    lem = [tok.lemma_ for tok in spac]
    no_punct = [tok for tok in lem if re.match('\w+', tok) or tok in ["-PRON-"]]
    no_numbers = [tok for tok in no_punct if not re.match('\d+', tok)]
    print(no_numbers)

# > ['-PRON-', 'be', 'go', 'to', 'get', 'a', 'cat', 'tomorrow']
# > ['-PRON-', 'do', 'not', 'know', 'if', '-PRON-', 'will', 'be', 'able', 'to', 'get', '-PRON-', 'a', 'cat', 'house', 'though']

请告诉我这是否解决了您的问题，因为我可能误解了您的问题。

您到底想实现什么？你能发布一些预期的结果吗？嗨@JerryM.，我基本上想把我的文本进行柠檬化，去掉数字和标点符号。最后我想使用LDA的预处理数据。嘿@混血王子，你的解决方案对我有效，因为我想预处理LDA的数据。唯一的问题是我用的是德语而不是英语，但这没问题。一个问题：英语代词有什么问题？没什么大不了的，但是SpaCy的lemmatizer在遇到英语代词时会返回

“-PRON-”

，而不是返回实际的代词。所以我只是在代码中添加了

或tok[“-PRON-”]

，这样在使用regex时就不会删除潜在的英语代词。但是你也可以选择排除他们！