matlab中搜索查询的TF-IDF
我已经实现了称为MMR的机器学习算法,即最大边际相关性。基本上我会有一个查询和文档,算法会计算我分配给文档的任何查询的相关比率 现在,我使用tf idf格式的20个新闻组数据集,在这里找到:()称为fea。我有点困惑,我不确定我的查询是否是tf idf格式的。因为我的代码中的查询和文档应该是tf idf格式matlab中搜索查询的TF-IDF,matlab,machine-learning,tf-idf,Matlab,Machine Learning,Tf Idf,我已经实现了称为MMR的机器学习算法,即最大边际相关性。基本上我会有一个查询和文档,算法会计算我分配给文档的任何查询的相关比率 现在,我使用tf idf格式的20个新闻组数据集,在这里找到:()称为fea。我有点困惑,我不确定我的查询是否是tf idf格式的。因为我的代码中的查询和文档应该是tf idf格式 function [result,index] = mmr3(query,lambda,docs) load fea1 fea1=fea1'; queries=zeros(1,262
function [result,index] = mmr3(query,lambda,docs)
load fea1
fea1=fea1';
queries=zeros(1,26214);
queries(query)=1/(size(query,2)); %normalize and set values at appropriate places
query=queries';
A=fea1(:,docs);
%indexes of documents, 18846 different documents
filenames=[docs];
selected=A(:,1); %select first (most relevant) document, this assumes first document listed
%is also most relvant to the query
selectedNames=docs(1); %name of selected document
filenames(docs(1))=[];
rest=A(:,2:end); %other documents go to variable rest
for i=1:5 %sort top five most relevant documents
MMRmax=-10;
for k=1:size(rest,2) %loop through not yet selected documents
max1=0;
for i=1:size(selected,2) %loop through selected documents
max=sim1(selected(:,i),rest(:,k));
if max>max1 %look for most similar document from not yet selected and selected
max1=max; %remeber highest cosine similarity
end
end
MMR=lambda*(sim1(query,rest(:,k))-(1-lambda)*max1); %calculate MMR
if MMR>MMRmax %find max MMR
MMRmax=MMR;
result(i)=MMRmax;
selected2=k;
end
end ![enter image description here][1]
selected(:,i+1)=rest(:,selected2); %select document with highest MMR
selectedNames(i+1)=filenames(selected2); %name of selected document
rest(:,selected2)=[]; %delete that document from rest
filenames(selected2)=[];
end
index=selectedNames;
%selectedNames
我使用了查询和文档之间的余弦相似性:
function [sim2] = sim1(A,B)
sim2=(A'*B)/(norm(A)*norm(B));
if(isnan(sim2))
sim2=0;
end
以下是输入和输出:
[result,index]=mmr3([1,2,3],0.2,[1:20])
result= 0.0012 -0.0018 -0.0040 -0.0043 -0.0080
index= 1 10 17 5 20 8
如有任何建议,将不胜感激 你的问题到底是什么?我的数据集是18846*26214向量长。这是DocumentID*WordID。我想要tf idf格式的查询。尽管如此,我不太确定我是否做得正确,因为它应该是26214长的查询,并遍历我的数据集“fea1”中的所有文档,然后计算MMR。