Python 如何使用已计算的TFIDF分数计算余弦相似度
我需要用已经计算的TFIDF分数计算文档之间的余弦相似性 通常我会使用(例如)创建一个文档/术语矩阵,计算TFIDF分数。我无法应用此项,因为它将重新计算TFIDF分数。这是不正确的,因为文档已经进行了大量的预处理,包括文字包和IDF过滤(我不会解释原因-太长) 说明性输入CSV文件:Python 如何使用已计算的TFIDF分数计算余弦相似度,python,numpy,scikit-learn,nlp,data-mining,Python,Numpy,Scikit Learn,Nlp,Data Mining,我需要用已经计算的TFIDF分数计算文档之间的余弦相似性 通常我会使用(例如)创建一个文档/术语矩阵,计算TFIDF分数。我无法应用此项,因为它将重新计算TFIDF分数。这是不正确的,因为文档已经进行了大量的预处理,包括文字包和IDF过滤(我不会解释原因-太长) 说明性输入CSV文件: Doc, Term, TFIDF score 1, apples, 0.3 1, bananas, 0.7 2, apples, 0.1 2, pears, 0.9 3, app
Doc, Term, TFIDF score
1, apples, 0.3
1, bananas, 0.7
2, apples, 0.1
2, pears, 0.9
3, apples, 0.6
3, bananas, 0.2
3, pears, 0.2
我需要生成通常由TFIDFVectorizer生成的矩阵,例如:
| apples | bananas | pears
1 | 0.3 | 0.7 | 0
2 | 0.1 | 0 | 0.9
3 | 0.6 | 0.2 | 0.2
。。。这样我就可以计算文档之间的余弦相似性
我使用的是Python2.7,但欢迎对其他解决方案或工具提出建议。我不能轻易地切换到Python 3
编辑:
这并不是关于转置numpy数组。它涉及将TFIDF分数映射到文档/术语矩阵,使用标记化术语,并将缺少的值填充为0 一个低效的黑客,我将离开这里,以防它帮助其他人。欢迎提出其他建议
def calculate_cosine_distance():
unique_terms = get_unique_terms_as_list()
tfidf_matrix = [[0 for i in range(len(unique_terms))] for j in range(TOTAL_NUMBER_OF_BOOKS)]
with open(INPUT_FILE_PATH, mode='r') as infile:
reader = csv.reader(infile.read().splitlines(), quoting=csv.QUOTE_NONE)
# Ignore header row
next(reader)
for rows in reader:
book = int(rows[0]) - 1 # To make it a zero-indexed array
term_index = int(unique_terms.index(rows[1]))
tfidf_matrix[book][term_index] = rows[2]
# Calculate distance between book X and book Y
print cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
def get_unique_terms_as_list():
unique_terms = set()
with open(INPUT_FILE_PATH, mode='rU') as infile:
reader = csv.reader(infile.read().splitlines(), quoting=csv.QUOTE_NONE)
# Skip header
next(reader)
for rows in reader:
unique_terms.add(rows[1])
unique_terms = list(unique_terms)
return unique_terms
我建议使用
scipy.sparse
from scipy.sparse import csr_matrix, coo_matrix
from sklearn.metrics.pairwise import cosine_similarity
input="""Doc, Term, TFIDF score
1, apples, 0.3
1, bananas, 0.7
2, apples, 0.1
2, pears, 0.9
3, apples, 0.6
3, bananas, 0.2
3, pears, 0.2"""
voc = {}
# sparse matrix representation: the coefficient
# with coordinates (rows[i], cols[i]) contains value data[i]
rows, cols, data = [], [], []
for line in input.split("\n")[1:]: # dismiss header
doc, term, tfidf = line.replace(" ", "").split(",")
rows.append(int(doc))
# map each vocabulary item to an int
if term not in voc:
voc[term] = len(voc)
cols.append(voc[term])
data.append(float(tfidf))
doc_term_matrix = coo_matrix((data, (rows, cols)))
# compressed sparse row matrix (type of sparse matrix with fast row slicing)
sparse_row_matrix = doc_term_matrix.tocsr()
print("Sparse matrix")
print(sparse_row_matrix.toarray()) # convert to array
# compute similarity between each pair of documents
similarities = cosine_similarity(sparse_row_matrix)
print("Similarity matrix")
print(similarities)
输出:
Sparse matrix
[[0. 0. 0. ]
[0.3 0.7 0. ]
[0.1 0. 0.9]
[0.6 0.2 0.2]]
Similarity matrix
[[0. 0. 0. 0. ]
[0. 1. 0.04350111 0.63344607]
[0. 0.04350111 1. 0.39955629]
[0. 0.63344607 0.39955629 1. ]]
如果您可以使用pandas在数据帧中首先读取整个csv文件,则会变得更容易
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('sample.csv', index_col=None, skipinitialspace=True)
# Converting the text Term to column index
le = LabelEncoder()
df['column']=le.fit_transform(df['Term'])
# Converting the Doc to row index
df['row']=df['Doc'] - 1
# Rows will be equal to max index of document
num_rows = max(df['row'])+1
# Columns will be equal to number of distinct terms
num_cols = len(le.classes_)
# Initialize the array with all zeroes
tfidf_arr = np.zeros((num_rows, num_cols))
# Iterate the dataframe and set the appropriate values in tfidf_arr
for index, row in df.iterrows():
tfidf_arr[row['row'],row['column']]=row['TFIDF score']
查看评论,如果不理解任何内容,请询问。您可以查看numpy.ndarray.transpose