Python 分类器中是否正确选择和使用了所有特征？_Python_Machine Learning_Scikit Learn_Feature Selection

Python 分类器中是否正确选择和使用了所有特征？

python machine-learning scikit-learn

Python 分类器中是否正确选择和使用了所有特征？,python,machine-learning,scikit-learn,feature-selection,Python,Machine Learning,Scikit Learn,Feature Selection,我想知道当我使用分类器时，例如： random_forest_bow = Pipeline([ ('rf_tfidf',Feat_Selection. countV), ('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3)) ]) random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])

我想知道当我使用分类器时，例如：

random_forest_bow = Pipeline([
        ('rf_tfidf',Feat_Selection. countV),
        ('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3))
        ])
    
random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_rf_ngram = random_forest_ngram.predict(DataPrep.test_news['Text'])
np.mean(predicted_rf_ngram == DataPrep.test_news['Label'])

我也在考虑模型中的其他特性。我对X和y的定义如下：

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40) 

df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

countV = CountVectorizer()
train_count = countV.fit_transform(df.train['Text'].values)

我的数据集如下所示

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

我还想使用as功能

is_it_capital？

，

is_it_upper？

，

包含_num？

，但由于它们有二进制值（编码后为1或0），我应该只对文本应用BoW来提取额外的功能。也许我的问题是显而易见的，但由于我是一名新的ML学习者，我不熟悉分类器和编码，我将感谢您提供的所有支持和评论。谢谢

您当然可以使用您的“额外”功能，如

是否大写？

，

是否大写？

，

是否包含数字？

。您似乎正在为如何准确地组合这两个看似完全不同的功能集而苦苦挣扎。您可以使用类似或的方法将不同的编码策略应用于每一组功能。没有理由不能将额外的特征与文本特征提取方法（例如，弓形方法）产生的结果结合使用

df=pd.DataFrame（{'text'：['this some text'，'this some MORE text'，'hi hi some text 123'，'bananas oranges']，'is_it_upper'：[0,1,0,0]，'contains_num'：[0,0,1,0]}）
从sklearn.feature\u extraction.text导入countvectorier
从sklearn.compose导入ColumnTransformer
transformer=ColumnTransformer（[（'text'，CountVectorizer（），'text'）]，余数='passthrough'）
X=变压器。拟合_变换（df）
打印（X）
[[0 0 0 1 0 0 1 1 1 0 0]
[0 0 0 1 1 0 1 1 1 1 0]
[1 0 2 0 0 0 1 1 0 0 1]
[0 1 0 0 0 1 0 0 0 0 0]]
打印（transformer.get_feature_names（））
['text_uu123'、'text_u香蕉'、'text_uHi'、'text_uis'、'text_umore'、'text_u橙子'、'text_usome'、'text_utext'、'text_uthis'、'is_uit_upper'、'contains_unum']

有关您的具体示例的更多信息：

X=df[['Text'，'is_it_capital？'，'is_it_upper？'，'contains_num？]]
y=df[“标签”]
#需要使用DensetTransformer正确连接结果
#从CountVectorizer和其他转换器步骤
从sklearn.base导入TransformerMixin
等级Denset变压器（TransformerMixin）：
def配合（自身、X、y=无，**配合参数）：
回归自我
def变换（自、X、y=无，**拟合参数）：
return X.todense（）
从sklearn.pipeline导入管道
管道=管道([
（'vectorizer'，CountVectorizer（）），
（'to_dense'，DenseTransformer（）），
])
transformer=ColumnTransformer（[（'text'，pipeline，'text'）]，余数='passthrough'）
X_序列，X_测试，y_序列，y_测试=序列测试分割（X，y，测试大小=0.25，随机状态=40）
X_列=变压器。拟合变换（X_列）
X_测试=变压器变换（X_测试）
df_系列=局部混凝土（[X_系列，y_系列]，轴=1）
df_测试=局部固结（[X_测试，y_测试]，轴=1）

我发现有用的是以完全控制的方式进行转换。对于每一组列，我将执行一个特定的转换，然后在最后合并我的转换：下面是一个示例

来自sklearn.compose的


从sklearn.pipeline导入管道
从sklearn.feature\u extraction.text导入countvectorier
从sklearn.impute导入SimpleImputer
从sklearn.model_selection导入列车测试_split，GridSearchCV
从sklearn.employ导入随机林分类器
#布尔值
布尔特征=[“是大写吗？”，“是大写吗？”，“包含数字吗？”，]
boolen_变压器=管道(
台阶=[
（'inputer'，simplemputer（策略='most_frequency'，），
)
]
)
text_功能='text'
text_transformer=管道(
步骤=[（'vectorizer'，CountVectorizer（））]
)
#合并所有管道
预处理器=列转换器(
变形金刚=[
（'bool'，boolean_transformer，boolean_features），
（“文本”、文本转换器、文本功能），
]
)
管道=管道(
台阶=[
（“预处理器”，预处理器），
（“模型”，随机森林分类器（n_估计值=300，n_作业=3））
]
)
#将数据分散到训练和测试中
X\u序列，X\u测试，y\u序列，y\u测试=序列测试分割（X，y，测试大小=.1，随机状态=42，分层=y）
#我们可以训练我们的模特
管道。安装（X_系列、y_系列）
管道分数（X_测试、y_测试）
#令人敬畏的是，使用GridSearch等其他工具变得很容易。
参数={'model_uu_un_估计量'：[100200300]，'model_u标准'：['gini'，'entropy']}
clf=GridSearchCV(
管道，cv=5，n_作业=-1，参数网格=params，评分='roc_auc'
)
clf.fit（X_系列、y_系列）
#预测完全看不见的数据
clf.预测（X_检验）

更新如果我们有不需要转换且需要包含的列，请添加

rements='passthrough'

#假设：上述代码没有布尔值X
# ...
预处理器=列转换器(
变形金刚=[
（“文本”、文本转换器、文本功能），
]，余数='passthrough'
)
#...

请参阅scikit学习文档和使用示例：

感谢您的回答，black site。是的，我正在努力解决这个问题，因为我不完全清楚如何包含这些功能。从您的示例来看，似乎我应该在分解到训练/测试之前应用ColumnTransformer。是吗？是的。使用

fit

fit\u transform

方法将ColumnTransformer应用于训练数据集（

df\u train

）。然后，仅使用

transform

方法将转换器应用于测试数据集（

df_test

），因为我们已经从

fit

fit_transform

方法中了解了我们的训练数据集需要的外观。更多关于fit/fit_变换的内容。非常非常类似于您在示例中使用CountVectorizer所做的。ColumnTransformer过程的结果将为您提供培训和测试的输入矩阵（

X\u-train

X\u-test

）。我更新了上面的答案。谢谢