Python 如何使用已计算的TFIDF分数计算余弦相似度

Python 如何使用已计算的TFIDF分数计算余弦相似度,python,numpy,scikit-learn,nlp,data-mining,Python,Numpy,Scikit Learn,Nlp,Data Mining,我需要用已经计算的TFIDF分数计算文档之间的余弦相似性 通常我会使用(例如)创建一个文档/术语矩阵,计算TFIDF分数。我无法应用此项,因为它将重新计算TFIDF分数。这是不正确的,因为文档已经进行了大量的预处理,包括文字包和IDF过滤(我不会解释原因-太长) 说明性输入CSV文件: Doc, Term, TFIDF score 1, apples, 0.3 1, bananas, 0.7 2, apples, 0.1 2, pears, 0.9 3, app

我需要用已经计算的TFIDF分数计算文档之间的余弦相似性

通常我会使用(例如)创建一个文档/术语矩阵,计算TFIDF分数。我无法应用此项,因为它将重新计算TFIDF分数。这是不正确的,因为文档已经进行了大量的预处理,包括文字包和IDF过滤(我不会解释原因-太长)

说明性输入CSV文件:

Doc, Term,    TFIDF score
1,   apples,  0.3
1,   bananas, 0.7
2,   apples,  0.1
2,   pears,   0.9
3,   apples,  0.6
3,   bananas, 0.2
3,   pears,   0.2
我需要生成通常由TFIDFVectorizer生成的矩阵,例如:

  | apples | bananas | pears
1 | 0.3    | 0.7     | 0
2 | 0.1    | 0       | 0.9
3 | 0.6    | 0.2     | 0.2 
。。。这样我就可以计算文档之间的余弦相似性

我使用的是Python2.7,但欢迎对其他解决方案或工具提出建议。我不能轻易地切换到Python 3

编辑:


这并不是关于转置numpy数组。它涉及将TFIDF分数映射到文档/术语矩阵,使用标记化术语,并将缺少的值填充为0

一个低效的黑客,我将离开这里,以防它帮助其他人。欢迎提出其他建议

def calculate_cosine_distance():
    unique_terms = get_unique_terms_as_list()

    tfidf_matrix = [[0 for i in range(len(unique_terms))] for j in range(TOTAL_NUMBER_OF_BOOKS)]

    with open(INPUT_FILE_PATH, mode='r') as infile:
        reader = csv.reader(infile.read().splitlines(), quoting=csv.QUOTE_NONE)

        # Ignore header row
        next(reader)

        for rows in reader:
            book = int(rows[0]) - 1 # To make it a zero-indexed array
            term_index = int(unique_terms.index(rows[1]))
            tfidf_matrix[book][term_index] = rows[2]

    # Calculate distance between book X and book Y
    print cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

def get_unique_terms_as_list():
    unique_terms = set()
    with open(INPUT_FILE_PATH, mode='rU') as infile:
        reader = csv.reader(infile.read().splitlines(), quoting=csv.QUOTE_NONE)
        # Skip header
        next(reader)
        for rows in reader:
            unique_terms.add(rows[1])

        unique_terms = list(unique_terms)
    return unique_terms

我建议使用
scipy.sparse

from scipy.sparse import csr_matrix, coo_matrix
from sklearn.metrics.pairwise import cosine_similarity

input="""Doc, Term,    TFIDF score
1,   apples,  0.3
1,   bananas, 0.7
2,   apples,  0.1
2,   pears,   0.9
3,   apples,  0.6
3,   bananas, 0.2
3,   pears,   0.2"""

voc = {}

# sparse matrix representation: the coefficient
# with coordinates (rows[i], cols[i]) contains value data[i]
rows, cols, data = [], [], []

for line in input.split("\n")[1:]: # dismiss header

    doc, term, tfidf = line.replace(" ", "").split(",")

    rows.append(int(doc))

    # map each vocabulary item to an int
    if term not in voc:
        voc[term] = len(voc)

    cols.append(voc[term])
    data.append(float(tfidf))

doc_term_matrix = coo_matrix((data, (rows, cols)))

# compressed sparse row matrix (type of sparse matrix with fast row slicing)
sparse_row_matrix = doc_term_matrix.tocsr()

print("Sparse matrix")
print(sparse_row_matrix.toarray()) # convert to array

# compute similarity between each pair of documents
similarities = cosine_similarity(sparse_row_matrix)

print("Similarity matrix")
print(similarities)
输出:

Sparse matrix
[[0.  0.  0. ]
 [0.3 0.7 0. ]
 [0.1 0.  0.9]
 [0.6 0.2 0.2]]
Similarity matrix
[[0.         0.         0.         0.        ]
 [0.         1.         0.04350111 0.63344607]
 [0.         0.04350111 1.         0.39955629]
 [0.         0.63344607 0.39955629 1.        ]]

如果您可以使用pandas在数据帧中首先读取整个csv文件,则会变得更容易

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('sample.csv', index_col=None, skipinitialspace=True)

# Converting the text Term to column index
le = LabelEncoder()
df['column']=le.fit_transform(df['Term'])

# Converting the Doc to row index
df['row']=df['Doc'] - 1

# Rows will be equal to max index of document
num_rows = max(df['row'])+1

# Columns will be equal to number of distinct terms
num_cols = len(le.classes_)

# Initialize the array with all zeroes
tfidf_arr = np.zeros((num_rows, num_cols))

# Iterate the dataframe and set the appropriate values in tfidf_arr
for index, row in df.iterrows():
    tfidf_arr[row['row'],row['column']]=row['TFIDF score']

查看评论,如果不理解任何内容,请询问。

您可以查看numpy.ndarray.transpose