Dataframe 无法预处理具有朴素贝叶斯和多个特征的二进制文本分类中的数据

Dataframe 无法预处理具有朴素贝叶斯和多个特征的二进制文本分类中的数据,dataframe,scikit-learn,naivebayes,countvectorizer,Dataframe,Scikit Learn,Naivebayes,Countvectorizer,我有一个二进制分类问题,我想把我的数据分为两组:汽车公司和非汽车公司。我对网站进行了爬网并提取了以下功能(简化): 域名:我爬过的网站 asn:服务器的自治系统编号 机器人:如果网站激活了robots.txt 电子邮件:网站所有者的amil地址 不同的日子:网站上线的日子 html_title:网站解析后的html标题 我尝试了一个基准模型,其中X是“html_title”,y是“carcompany”,精度达到了0.95,非常好。我选择了互补NB而不是多项式,因为我知道分类的最终数据将是不平衡

我有一个二进制分类问题,我想把我的数据分为两组:汽车公司和非汽车公司。我对网站进行了爬网并提取了以下功能(简化):

  • 域名:我爬过的网站
  • asn:服务器的自治系统编号
  • 机器人:如果网站激活了robots.txt
  • 电子邮件:网站所有者的amil地址
  • 不同的日子:网站上线的日子
  • html_title:网站解析后的html标题
  • 我尝试了一个基准模型,其中X是“html_title”,y是“carcompany”,精度达到了0.95,非常好。我选择了互补NB而不是多项式,因为我知道分类的最终数据将是不平衡的。我想在预测中添加更多的特性(列),尽管我知道条件独立性的假设可能会被违反

    但是,我无法管理预处理(包括数据帧)。再次阅读NB后,我现在有疑问,所以我的问题是:

  • 朴素贝叶斯可以与多个特征(列)一起使用吗
  • 朴素贝叶斯可以用于具有多类特征(字符串、整数、布尔)的文本分类吗?如果我把它们都转换成字符串呢
  • 我的代码错了吗?在哪里
  • 提前感谢:)

    导入包 创建数据 将数据转换为字符串(如果将int和boolean转换为字符串是正确的,则不使用surre) 准备培训数据 这就是问题所在 X_列是4x4的稀疏矩阵。它应该更大,并具有其他功能

    根据我的尝试,我还收到了以下消息: ValueError:找到样本数不一致的输入变量:[7,15]

    未安装错误:CountVectorier-词汇表未安装

    AttributeError:'numpy.ndarray'对象没有属性'lower'


    您需要将文本数据列(我认为是html_标题)矢量化,而不是整个X_列

    cv = CountVectorizer(stop_words=stopwords)
    X_train_transformed =  cv.fit_transform(X_train['html_title'])
    
    textual_feature =  pd.DataFrame(X_train_transformed.todense(), columns =cv.get_feature_names())
    
    现在,在这个数据框架中添加您认为会提高模型预测能力的其他特性

    dummy = {"domain":["a.de","b.de","c.de","d.de","e.de","f.de","g.de","h.de","i.de","j.de","k.de","l.de","m.de","n.de","o.de","p.de","q.de","r.de","s.de","t.de","u.de","v.de","w.de","x.de","y.de","z.de","aa.de","bb.de","cc.de"],
    "asn":["123","789","491","238","148","369","123","458","231","549","894","153","654","658","987","369","258","147","852","963","741","652","365","547","785","985","589","632","456"],
    "robots":["True","Test","False","True","False","False","False","False","True","False","False","True","False","True","True","Test","False","True","True","True","False","True","True","False","False","True","False","False","False"],
    "email":["@a.de","@b.de","@c.de","@d.de","@e.de","@f.de","@g.de","@h.de","@i.de","@j.de","@k.de","@l.de","@m.de","@n.de","@o.de","@p.de","@q.de","@r.de","@s.de","@t.de","@u.de","@v.de","@w.de","@x.de","@y.de","@z.de","@aa.de","@bb.de","@cc.de"],
    "diff_days_stand":["0.9","0.8","0.7","0.6","0.5","0.4","0.3","0.2","0.1","0.9","0.8","0.7","0.6","0.5","0.9","0.8","0.7","0.6","0.5","0.4","0.3","0.2","0.1","0.9","0.8","0.7","0.6","0.5","0.1"],
    "html_title":["audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "audi bmw mercedes", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes", "apple dell acer", "audi bmw mercedes"]}
    dummy = pd.DataFrame(dummy)
    stopwords = ['a','ab','aber','ach','acht']
    
    list1 = ['domain', 'asn', 'robots', 'email', 'diff_days_stand', 'html_title'] 
    for i in list1:
        dummy[i] = dummy[i].astype(str)
    
    train_t = dummy.loc[0:9,("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
    train_f = dummy.loc[10:19,("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
    rest    = dummy.loc[20:30, ("domain", "asn", "robots", "email", "diff_days_stand", "html_title")].copy().reset_index()
    
    train_t["carcompany"] = 1
    train_f["carcompany"] = 0
    train_tot = train_f.append(train_t)
    train_tot = train_tot.drop(labels="index", axis=1)
    
    y = train_tot["carcompany"]
    X_train, X_test, y_train, y_test = train_test_split(train_tot, y , test_size=0.25, random_state=53)
    
    cv = CountVectorizer(stop_words=stopwords)
    X_train_transformed =  cv.fit_transform(X_train)
    X_test_transformed = cv.transform(X_test)
    
    cb = ComplementNB(alpha=1.0, fit_prior=True, class_prior=None, norm=False)
    cb.fit(X_train_transformed, y_train, sample_weight=None)
    
    pred = cb.predict(X_test_transformed)
    score = cb.score(X_test_transformed, y_test)
    
    cv = CountVectorizer(stop_words=stopwords)
    X_train_transformed =  cv.fit_transform(X_train['html_title'])
    
    textual_feature =  pd.DataFrame(X_train_transformed.todense(), columns =cv.get_feature_names())