Pandas 如何在数据帧中进行单词标记
这是我的数据Pandas 如何在数据帧中进行单词标记,pandas,scikit-learn,nltk,tokenize,Pandas,Scikit Learn,Nltk,Tokenize,这是我的数据 No Text 1 You are smart 2 You are beautiful 我的预期产出 No Text You are smart beautiful 1 You are smart 1 1 1 0 2 You are beautiful 1 1
No Text
1 You are smart
2 You are beautiful
我的预期产出
No Text You are smart beautiful
1 You are smart 1 1 1 0
2 You are beautiful 1 1 0 1
对于
nltk
解决方案,需要word\u标记化
对于单词列表,然后最后一个到原始:
from sklearn.preprocessing import MultiLabelBinarizer
from nltk import word_tokenize
mlb = MultiLabelBinarizer()
s = df.apply(lambda row: word_tokenize(row['Text']), axis=1)
df = df.join(pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index))
print (df)
No Text You are beautiful smart
0 1 You are smart 1 1 0 1
1 2 You are beautiful 1 1 1 0
对于纯熊猫
使用+:
df = df.join(df['Text'].str.get_dummies(sep=' '))