Scikit learn 管线中LogisticReturnal的_coef值太多_Scikit Learn_Logistic Regression_Sklearn Pandas_Coefficients

Scikit learn 管线中LogisticReturnal的_coef值太多

scikit-learn

Scikit learn 管线中LogisticReturnal的_coef值太多,scikit-learn,logistic-regression,sklearn-pandas,coefficients,Scikit Learn,Logistic Regression,Sklearn Pandas,Coefficients,我正在利用sklearn管道中的sklearn熊猫。为了评估特征联合管道中的特征贡献，我喜欢测量估计器的系数（逻辑回归）。对于以下代码示例，对三个文本内容列a、b和c进行矢量化，并为X\u列选择： import pandas as pd import numpy as np import pickle from sklearn_pandas import DataFrameMapper from sklearn.feature_extraction.text import CountVector

我正在利用sklearn管道中的sklearn熊猫。为了评估特征联合管道中的特征贡献，我喜欢测量估计器的系数（逻辑回归）。对于以下代码示例，对三个文本内容列

a、b和c
进行矢量化，并为X\u列
选择：
import pandas as pd
import numpy as np
import pickle
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
np.random.seed(1)

data = pd.read_csv('https://pastebin.com/raw/WZHwqLWr')
#data.columns

X = data.copy()
y = data.result
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

mapper = DataFrameMapper([
        ('a', CountVectorizer()),
        ('b', CountVectorizer()),
        ('c', CountVectorizer())
])

pipeline = Pipeline([
        ('featurize', mapper),
        ('clf', LogisticRegression(random_state=1))
        ])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(abs(pipeline.named_steps['clf'].coef_))
#array([[0.3567311 , 0.3567311 , 0.46215153, 0.10542043, 0.3567311 ,
#        0.46215153, 0.46215153, 0.3567311 , 0.3567311 , 0.3567311 ,
#        0.3567311 , 0.46215153, 0.46215153, 0.3567311 , 0.46215153,
#        0.3567311 , 0.3567311 , 0.3567311 , 0.3567311 , 0.46215153,
#        0.46215153, 0.46215153, 0.3567311 , 0.3567311 ]])

print(len(pipeline.named_steps['clf'].coef_[0]))
#24

与多个特征的常规分析（通常返回长度等于特征数量的系数）不同，DataFrameMapper返回更大的系数矩阵
a） 如何解释大写的总共24个系数？
b） 获取每个特征值（“a”、“b”、“c”）的最佳方法是什么
期望输出：
a: coef_score (float)
b: coef_score (float)
c: coef_score (float)

谢谢大家!
 从管道
恢复安装的数据帧映射器
后，您可以使用功能
方法访问其内容。这使得您可以遍历用于将字符串转换为一个热编码变量的CountVectorizer
函数。每个CountVectorIzer都有一个词汇表
方法，可以准确地告诉您字符串所代表的列
因此，您可以按顺序拉出DataFrameMapper
中的每个CountVectorizer
，并按顺序提取表示输入矩阵中每列的字符串。这将允许您拥有一个精确表示系数标签的序列
根据您的示例，此代码片段应该满足您的需要，我在上面详细描述了这一点（如果您遇到任何错误，请警告我，我将根据您的反馈进行更正）：
虽然您的初始数据帧确实只包含三个特性的列a
、b
和c
，但PandasDataFrameMapper（）
类将SKlearn的CountVectorizer（）
应用于每列a、b和c的相应词体。这导致总共创建了24个特性，然后将这些特性传递给您的logistic回归（）
分类器。这就是为什么当您试图访问分类器的.coef\uu
属性时，会得到一个包含24个值的未标记列表
然而，将这24个coeff\uuucode>分数与原始列（a
、b
、或c
）进行匹配，然后计算每个列的平均系数分数是非常简单的。以下是我们的做法：
原始数据帧如下所示：
             a                   b                c   result
2   here we go   hello here we are   this is a test        0
73  here we go   hello here we are   this is a test        0
...

如果我们运行以下行，我们可以看到由DataFrameMapper
/CountVectorizer（）
在mapper
对象中使用的所有24个功能的列表：
pipeline.named_steps['featurize'].transformed_names_

['a_another',
 'a_example',
 'a_go',
 'a_here',
 'a_is',
 'a_we',
 'b_are',
 'b_column',
 'b_content',
 'b_every',
 'b_has',
 'b_hello',
 'b_here',
 'b_text',
 'b_we',
 'c_can',
 'c_deal',
 'c_feature',
 'c_how',
 'c_is',
 'c_test',
 'c_this',
 'c_union',
 'c_with']

len(pipeline.named_steps['featurize'].transformed_names_)

24

现在，我们来计算来自a
/b
/c
列的三组功能的平均coef分数：
col_names = list(data.drop(['result'], axis=1).columns.values)
vect_feats = pipeline.named_steps['featurize'].transformed_names_
clf_coef_scores = abs(pipeline.named_steps['clf'].coef_)

def get_avg_coef_scores(col_names, vect_feats, clf_coef_scores):
    scores = {}
    start_pos = 0
    for n in col_names:
        num_vect_feats = len([i for i in vect_feats if i[0] == n])
        end_pos = start_pos + num_vect_feats
        scores[n + '_avg_coef_score'] = np.mean(clf_coef_scores[0][start_pos:end_pos])
        start_pos = end_pos
    return scores

如果我们调用刚刚编写的函数，我们将得到以下输出：
get_avg_coef_scores(col_names, vect_feats, clf_coef_scores)

{'a_avg_coef_score': 0.3499861323284858,
 'b_avg_coef_score': 0.40358462487685853,
 'c_avg_coef_score': 0.3918712435073411}

如果我们想验证24个系数分数中的哪一个属于每个创建的特征，我们可以使用以下词典：
{key:clf_coef_scores[0][i] for i, key in enumerate(vect_feats)}

{'a_another': 0.3567310993987888,
 'a_example': 0.3567310993987888,
 'a_go': 0.4621515317244458,
 'a_here': 0.10542043232565701,
 'a_is': 0.3567310993987888,
 'a_we': 0.4621515317244458,
 'b_are': 0.4621515317244458,
 'b_column': 0.3567310993987888,
 'b_content': 0.3567310993987888,
 'b_every': 0.3567310993987888,
 'b_has': 0.3567310993987888,
 'b_hello': 0.4621515317244458,
 'b_here': 0.4621515317244458,
 'b_text': 0.3567310993987888,
 'b_we': 0.4621515317244458,
 'c_can': 0.3567310993987888,
 'c_deal': 0.3567310993987888,
 'c_feature': 0.3567310993987888,
 'c_how': 0.3567310993987888,
 'c_is': 0.4621515317244458,
 'c_test': 0.4621515317244458,
 'c_this': 0.4621515317244458,
 'c_union': 0.3567310993987888,
 'c_with': 0.3567310993987888}

非常感谢。当我应用代码时，它似乎给出了单词的分数，但没有给出特征列的分数（a、b、c）？然而，如果可能的话，这也是一个有趣的附带问题，我如何从上面的示例中选择特定的单词。在使用CountVectorizer将级别编码为单变量后，features列不再有系数，但新创建的features有系数，它们是二进制变量，每一个代表一个单词。您仍然可以收集从原始特征之一派生的所有二进制变量，但是没有任何有意义的方法将它们表示为单个系数。即使你可以计算与某个特征相关的单词系数的平均值，或者只是将它们相加，甚至计算最高值和最低值之间的差值，这也没有用。还有没有一种方法可以使用列表而不是dict？最后，我需要将所有分数正常化，以便它们是相对的。此外，由于列的缩放方式不同，是否有方法进行校正以使系数具有可比性？为了规范化所有分数，您必须在“Featureize”和“clf”之间输入sklearn.preprocessing.StandardScaler函数，因此，您将对不同的字数变量进行统计标准化，并使其系数具有可比性。另一种方法是让sklearn.feature\u extraction.text.CountVectorizer使用参数binary=True，在这种情况下，您不会对出现次数进行任何计数，而只对每个单词使用一个纯二进制。我刚刚看到您向函数添加了start\u pos=end\u pos。这是有意义的，对循环来说确实是至关重要的。谢谢，是的，它确保了我们使用正确的系数值来计算a、b和c列的平均系数。嗨@James Dellinger，最后一个问题：当DataFrameMapper具有未进行计数向量化的功能时，这不起作用，对吗？我的意思是，例如：'mapper=DataFrameMapper（[（'a'，CountVectorizer（）），（'b'，CountVectorizer（）），（'c'，None）]），它看起来像'clf_coef_scores'每个特征列只输出一个值。
{key:clf_coef_scores[0][i] for i, key in enumerate(vect_feats)}

{'a_another': 0.3567310993987888,
 'a_example': 0.3567310993987888,
 'a_go': 0.4621515317244458,
 'a_here': 0.10542043232565701,
 'a_is': 0.3567310993987888,
 'a_we': 0.4621515317244458,
 'b_are': 0.4621515317244458,
 'b_column': 0.3567310993987888,
 'b_content': 0.3567310993987888,
 'b_every': 0.3567310993987888,
 'b_has': 0.3567310993987888,
 'b_hello': 0.4621515317244458,
 'b_here': 0.4621515317244458,
 'b_text': 0.3567310993987888,
 'b_we': 0.4621515317244458,
 'c_can': 0.3567310993987888,
 'c_deal': 0.3567310993987888,
 'c_feature': 0.3567310993987888,
 'c_how': 0.3567310993987888,
 'c_is': 0.4621515317244458,
 'c_test': 0.4621515317244458,
 'c_this': 0.4621515317244458,
 'c_union': 0.3567310993987888,
 'c_with': 0.3567310993987888}