Python 使用pos创建词汇表
我想创建一个使用词性标记的语义实体(名词、动词、点等)列表。 我目前正在运行以下代码Python 使用pos创建词汇表,python,pandas,nlp,spacy,pos,Python,Pandas,Nlp,Spacy,Pos,我想创建一个使用词性标记的语义实体(名词、动词、点等)列表。 我目前正在运行以下代码 import spacy import pandas as pd nlp = spacy.load('en_core_web_sm',disable=['ner','textcat']) def fun(text): doc = nlp(text) pos = "" for token in doc: pos += token.pos_ +
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
def fun(text):
doc = nlp(text)
pos = ""
for token in doc:
pos += token.pos_ + " "
return pos
df['S']= df.Text.apply(fun)
创造句子的结构。
因此,例如,如果我有列文本(见下文),这段代码将生成列S,其中包含关于语义结构的所有信息:
Text S
0 “I will meet quite a few people, it’s well... PUNCT NOUN VERB VERB DET DET ADJ NOUN PUNCT PR...
1 Says “Cristiano Ronaldo’s family still owns”... VERB PUNCT PROPN PROPN PART NOUN ADV VERB PUNC...
2 Joe Biden plagiarized Donald Trump in his... PROPN PROPN VERB PROPN PROPN ADP DET PROP...
我想知道我是否能创造一个名词、动词、形容词、形容词等的词汇表。。。通过编辑上面的代码或者如果我需要考虑不同的方法。
为了获取数据框架中的所有实体(名词、动词等),我将只选择唯一的值,以便为每个值创建一个列表
输出示例(也可以在列表中,而不是在数据帧中)
您可以尝试:
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
texts = ['"I will meet quite a few people, it\'s well',
'Says "Cristiano Ronaldo\'s family still owns"',
'Joe Biden plagiarized Donald Trump in his...']
df = pd.DataFrame({"Text":texts})
d = dict()
def func(text):
doc = nlp(text)
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
else:
d[tok.pos_].append(tok.text)
df.Text.apply(func)
pprint(d)
注意,您根本不需要依赖:
docs = nlp.pipe(texts)
d = dict()
for doc in docs:
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
else:
d[tok.pos_].append(tok.text)
pprint(d)
它们将收集其
POS
下的所有代币
如果您只需要唯一令牌列表:
texts = ['"I will will meet quite a few people, it\'s well',
'Says "Cristiano Ronaldo\'s family still owns"',
'Joe Biden plagiarized Donald Trump in his...']
docs = nlp.pipe(texts)
d = dict()
for doc in docs:
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
elif tok.text not in d[tok.pos_]:
d[tok.pos_].append(tok.text)
pprint(d)
预期输出?嗨,谢尔盖·布什马诺夫,我更新了问题,提供了一个输出示例。
非常感谢,谢尔盖。但是我得到了这个错误:TypeError:“module”对象不可调用(由-->14 pprint(d)引起),你知道它的意思吗?
putfrom pprint import pprint
在顶部
docs = nlp.pipe(texts)
d = dict()
for doc in docs:
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
else:
d[tok.pos_].append(tok.text)
pprint(d)
{'ADJ': ['few'],
'ADP': ['in'],
'ADV': ['well', 'still'],
'AUX': ["'s"],
'DET': ['quite', 'a', 'his'],
'NOUN': ['people', 'family'],
'PART': ["'s"],
'PRON': ['I', 'it'],
'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
'PUNCT': ['"', ',', '"', '"', '...'],
'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}
texts = ['"I will will meet quite a few people, it\'s well',
'Says "Cristiano Ronaldo\'s family still owns"',
'Joe Biden plagiarized Donald Trump in his...']
docs = nlp.pipe(texts)
d = dict()
for doc in docs:
for tok in doc:
if tok.pos_ not in d:
d[tok.pos_] = [tok.text]
elif tok.text not in d[tok.pos_]:
d[tok.pos_].append(tok.text)
pprint(d)
{'ADJ': ['few'],
'ADP': ['in'],
'ADV': ['well', 'still'],
'AUX': ["'s"],
'DET': ['quite', 'a', 'his'],
'NOUN': ['people', 'family'],
'PART': ["'s"],
'PRON': ['I', 'it'],
'PROPN': ['Cristiano', 'Ronaldo', 'Joe', 'Biden', 'Donald', 'Trump'],
'PUNCT': ['"', ',', '...'],
'VERB': ['will', 'meet', 'Says', 'owns', 'plagiarized']}