Python 向数据帧添加词性列_Python_Nltk

Python 向数据帧添加词性列

python

Python 向数据帧添加词性列,python,nltk,Python,Nltk,我有一个名为df2的数据框，它是从一组单词中创建的，这些单词包括频率和累计总数列： word freq cum_total 0 nesoi 1970 1970 1 cotton 734 2704 2 exceeding 732 3436 3 fiber 620 4056 4 part 618 4

我有一个名为df2的数据框，它是从一组单词中创建的，这些单词包括频率和累计总数列：

    word      freq    cum_total     
0   nesoi     1970    1970          
1   cotton    734     2704          
2   exceeding 732     3436          
3   fiber     620     4056          
4   part      618     4674

我想使用NLTK向上表中添加一列，该列显示“word”列中每个单词所属的词性，因此输出如下所示：

    word      freq    cum_total  part_of_speech   
0   nesoi     1970    1970       noun
1   cotton    734     2704       noun    
2   exceeding 732     3436       adverb      
3   fiber     620     4056       adjective   
4   part      618     4674       pronoun

    word      freq    cum_total  part_of_speech   
0   nesoi     1970    1970       [(n, JJ), (e, NN), (s, NN), (o, NN), (i, NN)]  
1   cotton    734     2704       [(c, NNS), (o, VBP), (t, JJ), (t, NN), (o, NN)...    
2   exceeding 732     3436       [(e, NN), (x, NNP), (c, VBZ), (e, JJ), (e, IN)...      
3   fiber     620     4056       [(f, NN), (i, NN), (b, VBP), (e, NN), (r, NN)]   
4   part      618     4674       [(p, NN), (a, DT), (r, NN), (t, NN)]

这是我的密码：

import nltk
df2['part_of_speech']=df2['word'].apply(nltk.pos_tag)

结果输出如下所示：

    word      freq    cum_total  part_of_speech   
0   nesoi     1970    1970       noun
1   cotton    734     2704       noun    
2   exceeding 732     3436       adverb      
3   fiber     620     4056       adjective   
4   part      618     4674       pronoun

    word      freq    cum_total  part_of_speech   
0   nesoi     1970    1970       [(n, JJ), (e, NN), (s, NN), (o, NN), (i, NN)]  
1   cotton    734     2704       [(c, NNS), (o, VBP), (t, JJ), (t, NN), (o, NN)...    
2   exceeding 732     3436       [(e, NN), (x, NNP), (c, VBZ), (e, JJ), (e, IN)...      
3   fiber     620     4056       [(f, NN), (i, NN), (b, VBP), (e, NN), (r, NN)]   
4   part      618     4674       [(p, NN), (a, DT), (r, NN), (t, NN)]

如何根据“单词”列编写代码以获取所需的词性部分列？标记等价物是可以的（POS的缩写形式为2或3个字符）。

函数假定输入是由多个单词组成的文档/文本，并用空格分隔。这里的解决方案是将输入封装在一个列表中。此外，您可以直接从嵌套列表/元组输出中提取pos：

导入nltk
作为pd进口熊猫
df=pd.DataFrame（{'words'：['this'，'apple'，'run'，'pretty']}）
df['pos']=df['words'].apply（lambda x:nltk.pos_标记（[x]）[0][1]）

这将为您提供：

    words pos
0    this  DT
1   apple  NN
2     run  VB
3  pretty  RB

pos_标签

导入nltk
作为pd进口熊猫
df=pd.DataFrame（{'words'：['this'，'apple'，'run'，'pretty']}）
df['pos']=df['words'].apply（lambda x:nltk.pos_标记（[x]）[0][1]）

这将为您提供：

    words pos
0    this  DT
1   apple  NN
2     run  VB
3  pretty  RB

比我快…我想说：

df['pos']=pd.Series（nltk.pos_标签（df['words']）。apply（lambda x:x[1]）

是的，看起来也不错。虽然它会被写成无法区分单词数组和字符串，但这不是有点奇怪吗？太好了，对于Python编码来说，[0][1]在表达式末尾的意思是什么？响应返回为一个包含元组的列表，因此

[（a，b），（c，d）…（x，y）]

。我首先用

[0]

索引列表的第一个元素，然后用

[1]

索引元组的第二个元素。这将使您获得元组中的词性，而不是其他任何内容。这回答了你的问题吗？快告诉我吧…我想说：

df['pos']=pd.Series（nltk.pos_标签（df['words']）。apply（lambda x:x[1]）

[（a，b），（c，d）…（x，y）]

。我首先用

[0]

索引列表的第一个元素，然后用

[1]

索引元组的第二个元素。这将使您获得元组中的词性，而不是其他任何内容。这回答了你的问题吗？