创建TF-IDF矩阵Python 3.6_Python_Python 3.x_Matrix_Information Retrieval_Tf Idf

创建TF-IDF矩阵Python 3.6

python python-3.x matrix

创建TF-IDF矩阵Python 3.6,python,python-3.x,matrix,information-retrieval,tf-idf,Python,Python 3.x,Matrix,Information Retrieval,Tf Idf,我有100个文档（每个文档都是该文档中单词的简单列表）。现在我想创建一个TF-IDF矩阵，这样我就可以按排名创建一个小词搜索。我尝试使用TFIDFvectorier，但在语法中迷失了方向。任何帮助都将不胜感激。问候编辑：我将列表转换为字符串，并将其添加到父列表中： vectorizer = TfidfVectorizer(vocabulary=word_set) matrix = vectorizer.fit_transform(doc_strings) print(matrix) 这里wo

我有100个文档（每个文档都是该文档中单词的简单列表）。现在我想创建一个TF-IDF矩阵，这样我就可以按排名创建一个小词搜索。我尝试使用TFIDFvectorier，但在语法中迷失了方向。任何帮助都将不胜感激。问候

编辑：我将列表转换为字符串，并将其添加到父列表中：

vectorizer = TfidfVectorizer(vocabulary=word_set)
matrix = vectorizer.fit_transform(doc_strings)
print(matrix)

这里word_set是可能的不同单词的集合，doc_strings是一个列表，它将每个文档作为一个字符串包含；但是，当我打印矩阵时，我得到如下输出：

  (0, 839)  0.299458532286
  (0, 710)  0.420878518454
  (0, 666)  0.210439259227
  (0, 646)  0.149729266143
  (0, 550)  0.210439259227
  (0, 549)  0.210439259227
  (0, 508)  0.210439259227
  (0, 492)  0.149729266143
  (0, 479)  0.149729266143
  (0, 425)  0.149729266143
  (0, 401)  0.210439259227
  (0, 332)  0.210439259227
  (0, 310)  0.210439259227
  (0, 253)  0.149729266143
  (0, 216)  0.210439259227
  (0, 176)  0.149729266143
  (0, 122)  0.149729266143
  (0, 119)  0.210439259227
  (0, 111)  0.149729266143
  (0, 46)   0.210439259227
  (0, 26)   0.210439259227
  (0, 11)   0.149729266143
  (0, 0)    0.210439259227
  (1, 843)  0.0144007295367
  (1, 842)  0.0288014590734
  (1, 25)   0.0144007295367
  (1, 24)   0.0144007295367
  (1, 23)   0.0432021886101
  (1, 22)   0.0144007295367
  (1, 21)   0.0288014590734
  (1, 20)   0.0288014590734
  (1, 19)   0.0288014590734
  (1, 18)   0.0432021886101
  (1, 17)   0.0288014590734
  (1, 16)   0.0144007295367
  (1, 15)   0.0144007295367
  (1, 14)   0.0432021886101
  (1, 13)   0.0288014590734
  (1, 12)   0.0144007295367
  (1, 11)   0.0102462376715
  (1, 10)   0.0144007295367
  (1, 9)    0.0288014590734
  (1, 8)    0.0288014590734
  (1, 7)    0.0144007295367
  (1, 6)    0.0144007295367
  (1, 5)    0.0144007295367
  (1, 4)    0.0144007295367
  (1, 3)    0.0144007295367
  (1, 2)    0.0288014590734
  (1, 1)    0.0144007295367

这是否正确？如果正确，如何在特定文档中搜索给定单词的排名

您的代码运行正常。我用几句话举个例子。这里一句话相当于一份文件。希望这能对你有所帮助

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["welcome to stackoverflow my friend", 
          "my friend, don't worry, you can get help from stackoverflow"]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(corpus)
print(matrix)

正如我们所知，它返回tf idf加权文档术语矩阵

print（）

语句输出以下内容：

  (0, 2)    0.379303492809
  (0, 6)    0.379303492809
  (0, 7)    0.379303492809
  (0, 8)    0.533097824526
  (0, 9)    0.533097824526
  (1, 3)    0.342619853089
  (1, 5)    0.342619853089
  (1, 4)    0.342619853089
  (1, 0)    0.342619853089
  (1, 11)   0.342619853089
  (1, 10)   0.342619853089
  (1, 1)    0.342619853089
  (1, 2)    0.243776847332
  (1, 6)    0.243776847332
  (1, 7)    0.243776847332

那么，我们如何解释这个矩阵呢？您可以在每一行中看到一个元组

（x，y）

和一个值。在这里，元组表示文档编号（在本例中为句子编号）和特征编号

为了更好地理解，让我们打印特征列表（在本例中，特征是单词）及其索引

for i, feature in enumerate(vectorizer.get_feature_names()):
    print(i, feature)

它输出：

0 can
1 don
2 friend
3 from
4 get
5 help
6 my
7 stackoverflow
8 to
9 welcome
10 worry
11 you

所以，

欢迎来到stackoverflow，我的朋友

这句话被转换成如下

(0, 2)  0.379303492809
(0, 6)  0.379303492809
(0, 7)  0.379303492809
(0, 8)  0.533097824526
(0, 9)  0.533097824526

例如，前两行值可以解释如下

0 = sentence no.
2 = word index (index of the word `friend`)
0.379303492809 = tf-idf weight

0 = sentence no.
6 = word index (index of the word `my`)
0.379303492809 = tf-idf weight

从tf idf值中，您可以看到，

welcome

和

to

这两个词的排名应该高于第1句中的其他词

您可以扩展此示例来搜索特定句子或文档中给定单词的排名，以满足您的需要。

谢谢您的解释。