Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 在k-means聚类中如何使用tfidf值_Python 3.x_Nlp_K Means_Tf Idf_Tfidfvectorizer - Fatal编程技术网

Python 3.x 在k-means聚类中如何使用tfidf值

Python 3.x 在k-means聚类中如何使用tfidf值,python-3.x,nlp,k-means,tf-idf,tfidfvectorizer,Python 3.x,Nlp,K Means,Tf Idf,Tfidfvectorizer,我正在使用sckit学习库使用TF-IDF的K-means聚类。我知道K-means使用距离创建簇,距离用(x轴值,y轴值)表示,但tf idf是一个单一的数值。我的问题是如何通过K均值聚类将tf-idf值转换为(x,y)值。tf-idf不是单个值(即标量)。对于每个文档,它返回一个向量,其中向量中的每个值对应于词汇表中的每个单词 from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np from

我正在使用sckit学习库使用TF-IDF的K-means聚类。我知道K-means使用距离创建簇,距离用(x轴值,y轴值)表示,但tf idf是一个单一的数值。我的问题是如何通过K均值聚类将tf-idf值转换为(x,y)值。

tf-idf不是单个值(即标量)。对于每个文档,它返回一个向量,其中向量中的每个值对应于词汇表中的每个单词

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix

sent1 = "the quick brown fox jumps over the lazy brown dog"
sent2 = "mr brown jumps over the lazy fox"

corpus = [sent1, sent2]
vectorizer = TfidfVectorizer(input=corpus)

X = vectorizer.fit_transform(corpus)
print(X.todense())
>>> vectorizer.vocabulary_
{'the': 8,
 'quick': 7,
 'brown': 0,
 'fox': 2,
 'jumps': 3,
 'over': 6,
 'lazy': 4,
 'dog': 1,
 'mr': 5}
[out]:

matrix([[0.50077266, 0.35190925, 0.25038633, 0.25038633, 0.25038633,
         0.        , 0.25038633, 0.35190925, 0.50077266],
        [0.35409974, 0.        , 0.35409974, 0.35409974, 0.35409974,
         0.49767483, 0.35409974, 0.        , 0.35409974]])
0.7092938737640962
它返回一个二维矩阵,其中行表示句子,列表示词汇

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix

sent1 = "the quick brown fox jumps over the lazy brown dog"
sent2 = "mr brown jumps over the lazy fox"

corpus = [sent1, sent2]
vectorizer = TfidfVectorizer(input=corpus)

X = vectorizer.fit_transform(corpus)
print(X.todense())
>>> vectorizer.vocabulary_
{'the': 8,
 'quick': 7,
 'brown': 0,
 'fox': 2,
 'jumps': 3,
 'over': 6,
 'lazy': 4,
 'dog': 1,
 'mr': 5}
因此,当K-means试图找到两个文档之间的距离/相似性时,它执行矩阵中两行之间的相似性。例如,假设相似性只是两行之间的点积:

import numpy as np
vector1 = X.todense()[0]
vector2 = X.todense()[1]
float(np.dot(vector1, vector2.T))
[out]:

matrix([[0.50077266, 0.35190925, 0.25038633, 0.25038633, 0.25038633,
         0.        , 0.25038633, 0.35190925, 0.50077266],
        [0.35409974, 0.        , 0.35409974, 0.35409974, 0.35409974,
         0.49767483, 0.35409974, 0.        , 0.35409974]])
0.7092938737640962
Chris Potts有一个关于如何创建TF-IDF one等向量空间模型的很好的教程