Python中的NLP：矢量化后从SelectKBest获取单词名_Python_Nlp_Vectorization

Python中的NLP：矢量化后从SelectKBest获取单词名

python nlp

Python中的NLP：矢量化后从SelectKBest获取单词名,python,nlp,vectorization,Python,Nlp,Vectorization,我似乎找不到确切问题的答案。有人能帮忙吗我的数据帧（“df”）的简化描述：它有两列：一列是一堆文本（“注释”），另一列是一个二进制变量，指示分辨率时间是否高于平均值（“y”）我在课文上写了一大堆字： from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(lowercase=True, stop_words="english") matrix = vectorizer.f

我似乎找不到确切问题的答案。有人能帮忙吗

我的数据帧（“df”）的简化描述：它有两列：一列是一堆文本（“注释”），另一列是一个二进制变量，指示分辨率时间是否高于平均值（“y”）

我在课文上写了一大堆字：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True, stop_words="english")
matrix = vectorizer.fit_transform(df["Notes"])

我的矩阵是6290x4650。获取单词名称（即功能名称）没有问题：

接下来，我想知道这4650中的哪一个与高于平均分辨率的时间最相关；并减少我可能要在预测模型中使用的矩阵。我做了卡方检验来找出前20个最重要的单词

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
selector = SelectKBest(chi2, k=20)
selector.fit(matrix, y)
top_words = selector.get_support().nonzero()

# Pick only the most informative columns in the data.
chi_matrix = matrix[:,top_words[0]]

现在我被卡住了。如何从这个简化矩阵（“chi_矩阵”）中得到单词？我的功能名称是什么？我试着这样做：

chi_matrix.feature_names[selector.get_support(indices=True)].tolist()

或

这给了我一个错误：找不到功能名称。我错过了什么

我最近也遇到了类似的问题，但我没有被限制使用20个最相关的单词。相反，我可以选择chi分数高于设定阈值的单词。我会给你我用来完成第二项任务的方法。这比仅根据chi分数使用前n个单词更可取的原因是，这20个单词的分数可能非常低，因此对分类任务几乎没有贡献

下面是我如何完成二进制分类任务的：

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_selection import chi2

    THRESHOLD_CHI = 5 # or whatever you like. You may try with
     # for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
     # and measure the f1 scores

    X = df['text']
    y = df['labels']

    cv = CountVectorizer()
    cv_sparse_matrix = cv.fit_transform(X)
    cv_dense_matrix = cv_sparse_matrix.todense()

    chi2_stat, pval = chi2(cv_dense_matrix, y)

    chi2_reshaped = chi2_stat.reshape(1,-1)
    which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
    which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])

结果是一个矩阵，其中包含1个术语的chi分数高于阈值，以及0个术语的chi分数低于阈值。然后，该矩阵可以是带有cv矩阵或tfidf矩阵的

np.dot

，然后传递给分类器的

fit

方法

如果您这样做，那么矩阵

的列将与CountVectorizer
对象的列相对应，因此您可以通过比较哪些项将保留矩阵的非零列与的索引来确定哪些项与给定标签相关
，或者您可以忘记它，直接将其传递给分类器。
在弄清楚我真正想做什么（谢谢Daniel）并做了更多研究之后，我找到了一些其他方法来实现我的目标
方式1-
方法2——这是我使用的方法，因为它对我来说是最容易理解的，并且生成了一个很好的输出，列出了单词、chi2分数和p值。另一个线程在这里：
谢谢你的帮助，丹尼尔。根据chi分数选择术语是有意义的，而不仅仅是任意数量的术语。您提供的代码工作得非常好，我现在有一个4650x4650“which_ones_to_keep”矩阵。我需要花一些时间来完成这些步骤，以确定保留了哪些条款。这种工作对我的公司来说是新的，所以我觉得我需要更多地解释模型是如何创建的。欢迎您。输出矩阵中的列与CountVectorizer的列一一对应，因此后者的特征名称是前者的唯一标识符。同样的指数和东西。另外，矩阵是以矩阵形式提供的，而不仅仅是作为向量提供的，因为对于带有tf-idf或cv矩阵的np.dot，我需要矩阵形式的矩阵。如果没有要训练的分类器，但只想为给定的标签提取最佳预测值，则可以使用向量。
chi_matrix.feature_names[features.get_support()]

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_selection import chi2

    THRESHOLD_CHI = 5 # or whatever you like. You may try with
     # for threshold_chi in [1,2,3,4,5,6,7,8,9,10] if you prefer
     # and measure the f1 scores

    X = df['text']
    y = df['labels']

    cv = CountVectorizer()
    cv_sparse_matrix = cv.fit_transform(X)
    cv_dense_matrix = cv_sparse_matrix.todense()

    chi2_stat, pval = chi2(cv_dense_matrix, y)

    chi2_reshaped = chi2_stat.reshape(1,-1)
    which_ones_to_keep = chi2_reshaped > THRESHOLD_CHI
    which_ones_to_keep = np.repeat(which_ones_to_keep ,axis=0,repeats=which_ones_to_keep.shape[1])

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])

from sklearn.feature_selection import chi2
chi2score = chi2(X,df['AboveAverage'])[0]

wscores = zip(vectorizer.get_feature_names(),chi2score)
wchi2 = sorted(wscores,key=lambda x:x[1]) 
topchi2 = zip(*wchi2[-20:])
show=list(topchi2)
show

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2

vectorizer = CountVectorizer(lowercase=True,stop_words='english')
X = vectorizer.fit_transform(df["Notes"])

y = df['AboveAverage']

# Select 10 features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=10)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)), 
                                       columns=['ftr', 'score', 'pval'])
chi2_scores