Machine learning 基于统计模型的特征选择_Machine Learning_Statistics_Feature Extraction_Feature Selection

Machine learning 基于统计模型的特征选择

machine-learning statistics

Machine learning 基于统计模型的特征选择,machine-learning,statistics,feature-extraction,feature-selection,Machine Learning,Statistics,Feature Extraction,Feature Selection,问题陈述：我正在解决一个问题，我必须预测客户是否会选择贷款。我已将所有可用的数据类型（object，int）转换为整数，现在我的数据如下所示突出显示的列是我的目标列，其中 0表示是 1表示No 此数据集中有47个独立的列我想针对我的目标列对这些列进行功能选择我从Z-test开始 import numpy as np import scipy.stats as st import scipy.special as sp def feature_selection_pvalue(df,

问题陈述：

我正在解决一个问题，我必须预测客户是否会选择贷款。我已将所有可用的数据类型（object，int）转换为整数，现在我的数据如下所示

突出显示的列是我的目标列，其中
0表示是
1表示No
此数据集中有47个独立的列
我想针对我的目标列对这些列进行功能选择
我从Z-test开始

import numpy as np import scipy.stats as st import scipy.special as sp def feature_selection_pvalue(df,col_name,samp_size=1000): relation_columns=[] no_relation_columns=[] H0='There is no relation between target column and independent column' H1='There is a relation between target column and independent column' sample_data[col_name]=df[col_name].sample(samp_size) samp_mean=sample_data[col_name].mean() pop_mean=df[col_name].mean() pop_std=df[col_name].std() print (pop_mean) print (pop_std) print (samp_mean) n=samp_size q=.5 #lets calculate z #z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std/n) z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std / n) print (z) pval = 2 * (1 - st.norm.cdf(z)) print ('p values is==='+str(pval)) if pval< .05 : print ('Null hypothesis is Accepted for col ---- >'+H0+col_name) no_relation_columns.append(col_name) else: print ('Alternate Hypothesis is accepted -->'+H1) relation_columns.append(col_name) print ('length of list ==='+str(len(relation_columns))) return relation_columns,no_relation_columns
我的问题是

当结果每次都不同时，上述z-检验是否是进行特征选择的可靠方法

在这种情况下，选择特征的更好方法是什么？如果可能，请提供一个示例
在这种情况下，进行特征选择的更好方法是什么，如果可能，请提供一个例子
您是否能够使用scikit？他们提供了许多示例和可能性来选择您的功能：
如果我们看第一个（方差阈值）：
例如，这将只保留有一些差异的列，而不是其中只有相同的值

for items in df.columns: relation,no_relation=feature_selection_pvalue(df,items,5000)

from sklearn.feature_selection import VarianceThreshold X = df[['age', 'balance',...]] #select your columns sel = VarianceThreshold(threshold=(.8 * (1 - .8))) X_red = sel.fit_transform(X)