Python 如何在Scikit学习中手工设计TfidfVectorizer的功能？_Python_Scikit Learn_Nlp

Python 如何在Scikit学习中手工设计TfidfVectorizer的功能？

python scikit-learn nlp

Python 如何在Scikit学习中手工设计TfidfVectorizer的功能？,python,scikit-learn,nlp,Python,Scikit Learn,Nlp,我正在尝试按关键字对文档进行聚类。我正在使用以下代码生成tdidf矩阵： from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(max_df=.8, max_features=1000, min_df=0.07, stop_words='english',

我正在尝试按关键字对文档进行聚类。我正在使用以下代码生成tdidf矩阵：

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=.8, max_features=1000,
                             min_df=0.07, stop_words='english',
                             use_idf=True, tokenizer=tokenize_and_stem, 
                             ngram_range=(1,2))

tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

打印（tfidf\u matrix.shape）

（5672009）

，这意味着有567个文档，每个文档都混合了scikit learn TDIDF矢量器检测到的209个特征词

现在，我使用了

terms=tfidf\u矢量器.get\u feature\u names（）

来获取术语列表。运行

print（len（terms））

将提供

这些单词中的许多对于任务来说是不必要的，它们会给聚类增加噪音。我已经手动浏览了列表，并提取了有意义的特征名称，生成了一个新的

terms

列表。现在，运行

print（len（terms））

将提供

但是，运行

tfidf\u矢量器.fit\u transform（文档）

仍然会显示

（567209）

的形状，这意味着

fit\u transform（文档）

函数仍然使用209个术语的嘈杂列表，而不是手动选择的67个术语列表

如何让

tfidf\u矢量器.fit\u transform（documents）

函数使用67个手动选择的术语列表运行？我想这可能需要我在我的机器上的Scikit学习包中添加至少一个功能，对吗

非常感谢您的帮助。谢谢

我没有按照我在问题中要求的水平来解决问题。然而，我想出了一个目前有效的黑客解决方案

通过执行以下操作，我可以使用我手工制作的术语集：

1）从

terms=tfidf\u矢量器。获取功能名称（）

，打印出

terms

2）制作一个名为

不需要的\u术语的列表

，并手动填写步骤1中不需要的术语

3）在我的文档顶部，我将导入stopwords：

stopwords = nltk.corpus.stopwords.words('english')

for item in not_needed_words_list:
    stopwords.append(item)

将我不需要的术语列表添加到stopwords：

stopwords = nltk.corpus.stopwords.words('english')

for item in not_needed_words_list:
    stopwords.append(item)

有两种方法：

如果您已经确定了停止词列表（您称其为“任务不必要”），只需将它们放入

TfidfVectorizer

的

stop_words

参数中，即可在创建单词包时忽略它们。
请注意，如果将

stop_words

参数设置为自定义列表，则预定义的英语stop words将不再使用。如果要将预定义的英语列表与其他停止词组合，只需添加两个列表：

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
stop_words = list(ENGLISH_STOP_WORDS) + ['your','additional', 'stopwords']
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words) # add your other params here

如果您有固定的词汇表，并且只希望计算这些词汇（即您的

术语

列表），只需设置

TfidfVectorizer的词汇
参数：
tfidf_vectorizer = TfidfVectorizer(vocabulary=terms) # add your other params here


另一种方法是，您可以为所选功能使用TfidfVectorizer的“词汇表”
参数。