Pandas 如何在数据帧中进行单词标记_Pandas_Scikit Learn_Nltk_Tokenize

Pandas 如何在数据帧中进行单词标记

pandas scikit-learn

Pandas 如何在数据帧中进行单词标记,pandas,scikit-learn,nltk,tokenize,Pandas,Scikit Learn,Nltk,Tokenize,这是我的数据 No Text 1 You are smart 2 You are beautiful 我的预期产出 No Text You are smart beautiful 1 You are smart 1 1 1 0 2 You are beautiful 1 1

这是我的数据

No  Text                    
1   You are smart
2   You are beautiful

我的预期产出

No  Text                   You    are  smart  beautiful                 
1   You are smart            1      1      1          0
2   You are beautiful        1      1      0          1

对于

nltk

解决方案，需要

word\u标记化

对于单词列表，然后最后一个到原始：

from sklearn.preprocessing import MultiLabelBinarizer
from  nltk import word_tokenize

mlb = MultiLabelBinarizer()
s = df.apply(lambda row: word_tokenize(row['Text']), axis=1)
df = df.join(pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index))
print (df)
   No               Text  You  are  beautiful  smart
0   1      You are smart    1    1          0      1
1   2  You are beautiful    1    1          1      0

对于纯

熊猫

使用+：

df = df.join(df['Text'].str.get_dummies(sep=' '))