matlab中搜索查询的TF-IDF

matlab中搜索查询的TF-IDF,matlab,machine-learning,tf-idf,Matlab,Machine Learning,Tf Idf,我已经实现了称为MMR的机器学习算法,即最大边际相关性。基本上我会有一个查询和文档,算法会计算我分配给文档的任何查询的相关比率 现在,我使用tf idf格式的20个新闻组数据集,在这里找到:()称为fea。我有点困惑,我不确定我的查询是否是tf idf格式的。因为我的代码中的查询和文档应该是tf idf格式 function [result,index] = mmr3(query,lambda,docs) load fea1 fea1=fea1'; queries=zeros(1,262

我已经实现了称为MMR的机器学习算法,即最大边际相关性。基本上我会有一个查询和文档,算法会计算我分配给文档的任何查询的相关比率

现在,我使用tf idf格式的20个新闻组数据集,在这里找到:()称为fea。我有点困惑,我不确定我的查询是否是tf idf格式的。因为我的代码中的查询和文档应该是tf idf格式

function [result,index] = mmr3(query,lambda,docs)

load fea1

fea1=fea1'; 

queries=zeros(1,26214);

queries(query)=1/(size(query,2)); %normalize and set values at appropriate places
query=queries';
A=fea1(:,docs);
%indexes of documents, 18846 different documents
filenames=[docs];

selected=A(:,1);   %select first (most relevant) document, this assumes first document listed
                %is also most relvant to the query

selectedNames=docs(1);  %name of selected document
filenames(docs(1))=[];
rest=A(:,2:end);   %other documents go to variable rest

for i=1:5 %sort top five most relevant documents
MMRmax=-10;                   
for k=1:size(rest,2)      %loop through not yet selected documents
max1=0;
for i=1:size(selected,2) %loop through selected documents
max=sim1(selected(:,i),rest(:,k));       
if max>max1         %look for most similar document from not yet selected and selected
max1=max;         %remeber highest cosine similarity
        end
   end   
   MMR=lambda*(sim1(query,rest(:,k))-(1-lambda)*max1);  %calculate MMR
        if MMR>MMRmax                   %find max MMR
          MMRmax=MMR;
          result(i)=MMRmax;
         selected2=k;
        end
end  ![enter image description here][1]

selected(:,i+1)=rest(:,selected2);      %select document with highest MMR 
selectedNames(i+1)=filenames(selected2);  %name of selected document
rest(:,selected2)=[];                   %delete that document from rest
filenames(selected2)=[];

end
index=selectedNames;
%selectedNames 
我使用了查询和文档之间的余弦相似性:

function [sim2] = sim1(A,B)

sim2=(A'*B)/(norm(A)*norm(B));
if(isnan(sim2))
sim2=0; 
end
以下是输入和输出:

[result,index]=mmr3([1,2,3],0.2,[1:20])

result= 0.0012   -0.0018    -0.0040    -0.0043    -0.0080

index= 1    10    17    5     20   8

如有任何建议,将不胜感激

你的问题到底是什么?我的数据集是18846*26214向量长。这是DocumentID*WordID。我想要tf idf格式的查询。尽管如此,我不太确定我是否做得正确,因为它应该是26214长的查询,并遍历我的数据集“fea1”中的所有文档,然后计算MMR。