Dataframe 无法预处理具有朴素贝叶斯和多个特征的二进制文本分类中的数据
我有一个二进制分类问题,我想把我的数据分为两组:汽车公司和非汽车公司。我对网站进行了爬网并提取了以下功能(简化):Dataframe 无法预处理具有朴素贝叶斯和多个特征的二进制文本分类中的数据,dataframe,scikit-learn,naivebayes,countvectorizer,Dataframe,Scikit Learn,Naivebayes,Countvectorizer,我有一个二进制分类问题,我想把我的数据分为两组:汽车公司和非汽车公司。我对网站进行了爬网并提取了以下功能(简化): 域名:我爬过的网站 asn:服务器的自治系统编号 机器人:如果网站激活了robots.txt 电子邮件:网站所有者的amil地址 不同的日子:网站上线的日子 html_title:网站解析后的html标题 我尝试了一个基准模型,其中X是“html_title”,y是“carcompany”,精度达到了0.95,非常好。我选择了互补NB而不是多项式,因为我知道分类的最终数据将是不平衡
您需要将文本数据列(我认为是html_标题)矢量化,而不是整个X_列
cv = CountVectorizer(stop_words=stopwords)
X_train_transformed = cv.fit_transform(X_train['html_title'])
textual_feature = pd.DataFrame(X_train_transformed.todense(), columns =cv.get_feature_names())
现在,在这个数据框架中添加您认为会提高模型预测能力的其他特性
dummy = {"domain":["a.de","b.de","c.de","d.de","e.de","f.de","g.de","h.de","i.de","j.de","k.de","l.de","m.de","n.de","o.de","p.de","q.de","r.de","s.de","t.de","u.de","v.de","w.de","x.de","y.de","z.de","aa.de","bb.de","cc.de"],
"asn":["123","789","491","238","148","369","123","458","231","549","894","153","654","658","987","369","258","147","852","963","741","652","365","547","785","985","589","632","456"],
"robots":["True","Test","False","True","False","False","False","False","True","False","False","True","False","True","True","Test","False","True","True","True","False","True","True","False","False","True","False","False","False"],
"email":["@a.de","@b.de","@c.de","@d.de","@e.de","@f.de","@g.de","@h.de","@i.de","@j.de","@k.de","@l.de","@m.de","@n.de","@o.de","@p.de","@q.de","@r.de","@s.de","@t.de","@u.de","@v.de","@w.de","@x.de","@y.de","@z.de","@aa.de","@bb.de","@cc.de"],
"diff_days_stand":["0.9","0.8","0.7","0.6","0.5","0.4","0.3","0.2","0.1","0.9","0.8","0.7","0.6","0.5","0.9","0.8","0.7","0.6","0.5","0.4","0.3","0.2","0.1","0.9","0.8","0.7","0.6","0.5","0.1"],
"html_title":["audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes"]}
dummy = pd.DataFrame(dummy)
stopwords = ['a','ab','aber','ach','acht']
list1 = ['domain', 'asn', 'robots', 'email', 'diff_days_stand', 'html_title']
for i in list1:
dummy[i] = dummy[i].astype(str)
train_t = dummy.loc[0:9,("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
train_f = dummy.loc[10:19,("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
rest = dummy.loc[20:30, ("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
train_t["carcompany"] = 1
train_f["carcompany"] = 0
train_tot = train_f.append(train_t)
train_tot = train_tot.drop(labels="index", axis=1)
y = train_tot["carcompany"]
X_train, X_test, y_train, y_test = train_test_split(train_tot, y , test_size=0.25, random_state=53)
cv = CountVectorizer(stop_words=stopwords)
X_train_transformed = cv.fit_transform(X_train)
X_test_transformed = cv.transform(X_test)
cb = ComplementNB(alpha=1.0, fit_prior=True, class_prior=None, norm=False)
cb.fit(X_train_transformed, y_train, sample_weight=None)
pred = cb.predict(X_test_transformed)
score = cb.score(X_test_transformed, y_test)
cv = CountVectorizer(stop_words=stopwords)
X_train_transformed = cv.fit_transform(X_train['html_title'])
textual_feature = pd.DataFrame(X_train_transformed.todense(), columns =cv.get_feature_names())