Python 如何解决熊猫的问题
我正在用pd.get_假人预处理我的数据集,但结果不是我需要的 使用pd.get_dummies()正确吗? 或者我可以尝试什么方法Python 如何解决熊猫的问题,python,pandas,Python,Pandas,我正在用pd.get_假人预处理我的数据集,但结果不是我需要的 使用pd.get_dummies()正确吗? 或者我可以尝试什么方法 import pandas as pd rawdataset=[['apple','banana','carrot','daikon','egg'], ['apple','banana'], ['apple','banana','carrot'], ['daikon','egg','fennel']
import pandas as pd
rawdataset=[['apple','banana','carrot','daikon','egg'],
['apple','banana'],
['apple','banana','carrot'],
['daikon','egg','fennel'],
['apple','banana','daikon']]
dataset=pd.DataFrame(data=rawdataset)
print(pd.get_dummies(dataset))
我想是这样的:
apple banana carrot daikon egg fennel
0 1 1 1 1 1 0
1 1 1 0 0 0 0
........
0_apple 0_daikon 1_banana 1_egg 2_carrot 2_daikon 2_fennel
0 1 0 1 0 1 0 0
1 1 0 1 0 0 0 0
....
不是这样的:
apple banana carrot daikon egg fennel
0 1 1 1 1 1 0
1 1 1 0 0 0 0
........
0_apple 0_daikon 1_banana 1_egg 2_carrot 2_daikon 2_fennel
0 1 0 1 0 1 0 0
1 1 0 1 0 0 0 0
....
给你:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
rawdataset=[['apple','banana','carrot','daikon','egg'],
['apple','banana'],
['apple','banana','carrot'],
['daikon','egg','fennel'],
['apple','banana','daikon']]
def dummy(doc):
return doc
count_vec = CountVectorizer(tokenizer=dummy, preprocessor=dummy)
count_vec.fit(rawdataset)
X = count_vec.transform(rawdataset).todense()
pd.DataFrame(X, columns=count_vec.get_feature_names())
结果:
apple banana carrot daikon egg fennel
0 1 1 1 1 1 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
4 1 1 0 1 0 0
这里的附加好处是,您还可以将其应用于未查看的数据,如pd。get_dummies
无法以相同的方式转换其他未查看的测试数据
尝试:
收益率:
apple banana carrot daikon egg fennel
0 0 0 0 0 0 0
这是正确的输出给猫剥皮的不同方法
pd.get\u假人
和max
pd.get_dummies(dataset, prefix="", prefix_sep="").max(level=0, axis=1)
apple daikon banana egg carrot fennel
0 1 1 1 1 1 0
1 1 0 1 0 0 0
2 1 0 1 0 1 0
3 0 1 0 1 0 1
4 1 1 1 0 0 0
stack
,str.get\u dummies
,和sum
/max
:
df.stack().str.get_dummies().sum(level=0)
apple banana carrot daikon egg fennel
0 1 1 1 1 1 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
4 1 1 0 1 0 0
堆栈
和交叉表
u = df.stack()
pd.crosstab(u.index.get_level_values(0), u)
col_0 apple banana carrot daikon egg fennel
row_0
0 1 1 1 1 1 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
4 1 1 0 1 0 0