Python 特征选择中如何选择卡方阈值_Python_Scikit Learn_Text Classification_Tf Idf_Feature Selection

Python 特征选择中如何选择卡方阈值

python scikit-learn

Python 特征选择中如何选择卡方阈值,python,scikit-learn,text-classification,tf-idf,feature-selection,Python,Scikit Learn,Text Classification,Tf Idf,Feature Selection,关于这一点：我发现这个代码： import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_selection import chi2 THRESHOLD_CHI = 5 # or whatever you like. You may try with # for

关于这一点：

我发现这个代码：

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_selection import chi2

    THRESHOLD_CHI = 5 # or whatever you like. You may try with
     # for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
     # and measure the f1 scores

    X = df['text']
    y = df['labels']

    cv = CountVectorizer()
    cv_sparse_matrix = cv.fit_transform(X)
    cv_dense_matrix = cv_sparse_matrix.todense()

    chi2_stat, pval = chi2(cv_dense_matrix, y)

    chi2_reshaped = chi2_stat.reshape(1,-1)
    which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
    which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])

此代码计算卡方检验，并应将最佳功能保持在选定阈值内。

我的问题是如何选择卡方检验分数的阈值？

卡方检验没有特定的结果范围，因此很难事先确定阈值。通常，您可以根据p值对变量进行排序，逻辑是p值越低越好，因为它们意味着特征和目标变量之间的相关性越高（我们希望丢弃独立的特征，即不是目标变量的预测值）。在这种情况下，您必须决定保留多少功能，这是一个超参数，您可以手动调整，或者使用网格搜索进行更好的调整

请注意，您可以避免手动执行选择，sklearn已经实现了一个基于卡方检验选择最佳k功能的功能，您可以按如下方式使用该功能：

from sklearn.feature_selection import SelectKBest, chi2

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)

但是，如果出于任何原因，您希望仅依赖原始chi2值，您可以计算变量之间的最小值和最大值，然后将间隔分为n步，通过网格搜索进行测试