如何在Matlab中矢量化搜索函数?

如何在Matlab中矢量化搜索函数?,matlab,optimization,matrix,vectorization,Matlab,Optimization,Matrix,Vectorization,下面是一个与intersect not setdiff版本稍有不同的Matlab编码问题: 带有3列的评级矩阵a,第1列为可能重复的用户ID,第2列为可能重复的项目ID,第3列为从用户到项目的评级,范围为1到5 现在,我有一个子集user id smallUserIDList和一个子集item id smallItemIDList,然后我想在smallUserIDList中查找由用户评级的行,收集用户评级的项,并进行一些计算,例如setdiff with smallItemIDList并计算结果

下面是一个与intersect not setdiff版本稍有不同的Matlab编码问题:

带有3列的评级矩阵a,第1列为可能重复的用户ID,第2列为可能重复的项目ID,第3列为从用户到项目的评级,范围为1到5

现在,我有一个子集user id smallUserIDList和一个子集item id smallItemIDList,然后我想在smallUserIDList中查找由用户评级的行,收集用户评级的项,并进行一些计算,例如setdiff with smallItemIDList并计算结果,如下代码所示:

userStat = zeros(length(smallUserIDList), 1);
for i = 1:length(smallUserIDList)
    A2= A(A(:,1) == smallUserIDList(i), :);
    itemIDList_each = unique(A2(:,2));

    setDiff = setdiff(itemIDList_each , smallItemIDList);
    userStat(i) = length(setDiff);
end
userStat
最后,我发现profileviewer显示上面的循环效率很低,问题是如何通过矢量化改进这段代码,但有for循环的帮助

例如:

输入:

输出:


我认为您正在尝试删除一部分用户的固定评分集,并计算剩余评分的数量:

以下工作是否有效:

Asub = A(ismember(A(:,1), smallUserIDList),1:2);
Bremove = allcomb(smallUserIDList, smallItemIDList);
Akeep = setdiff(Asub, Bremove, 'rows');
T = varfun(@sum, array2table(Akeep), 'InputVariables', 'Akeep2', 'GroupingVariables', 'Akeep1');
% userStat = T.GroupCount;

您需要matlab central文件交换中的allcomb函数,它给出了两个向量的笛卡尔乘积,并且很容易实现。

我认为您正在尝试删除一部分用户的固定评分集,并计算剩余评分数:

以下工作是否有效:

Asub = A(ismember(A(:,1), smallUserIDList),1:2);
Bremove = allcomb(smallUserIDList, smallItemIDList);
Akeep = setdiff(Asub, Bremove, 'rows');
T = varfun(@sum, array2table(Akeep), 'InputVariables', 'Akeep2', 'GroupingVariables', 'Akeep1');
% userStat = T.GroupCount;
您需要matlab central文件交换中的allcomb函数,它给出了两个向量的笛卡尔积,并且易于实现。

这可能是一种向量化方法-

%// Take care of equality between first column of A and smallUserIDList to 
%// find the matching row and column indices.
%// NOTE: This corresponds to "A(:,1) == smallUserIDList(i)" from OP.
[R,C] = find(bsxfun(@eq,A(:,1),smallUserIDList.')); %//'

%// Take care of non-equality between second column of A and smallItemIDList. 
%// NOTE: This corresponds to SETDIFF in the original loopy code from OP.
mask1 = ~ismember(A(R,2),smallItemIDList);

AR2 = A(R,2); %// Elements from 2nd col of A that has matches from first step

%// Get only those elements from C and AR2 that has ONES in mask1
C1 = C(mask1);
AR2 = AR2(mask1);

%// Initialized output array
userStat = zeros(numel(smallUserIDList),1);

if ~isempty(C1)%//There is at least one element in C, so do further processing
    
    %// Find the count of duplicate elements for each ID in C1 indexed into AR2.
    %// NOTE: This corresponds to "unique(A2(:,2))" from OP.
    dup_counts = accumarray(C1,AR2,[],@(x) numel(x)-numel(unique(x)));
    
    %// Get the count of matches for each ID in C in the mask1.
    %// NOTE: This corresponds to:
    %//       "length(setdiff(itemIDList_each , smallItemIDList))" from OP.
    accums = accumarray(C,mask1);
    
    %// Store the counts in output array and also subtract the dup counts
    userStat(1:numel(accums)) = accums;
    userStat(1:numel(dup_counts)) = userStat(1:numel(dup_counts)) - dup_counts;
end
标杆管理 下面列出的代码将建议方法的运行时与原始循环代码进行比较-

%// Size parameters and random inputs with them
A_nrows    = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
A = [randi(max_userID,A_nrows,1) randi(max_itemID,A_nrows,1) randi(5,A_nrows,2)];
smallUserIDList = randi(max_userID,IDlist_len,1);
smallItemIDList = randi(max_itemID,IDlist_len,1);

disp('---------------------------- With Original Approach')
tic
%//   Original posted code
toc

disp('---------------------------- With Proposed Approach'))
tic
%//   Proposed approach code
toc
使用三组数据集获得的运行时如下所示:-

案例1:

案例2:

案例3:

结论:与原始循环代码相比,所提出的方法的加速似乎是巨大的

这可能是一种矢量化方法-

%// Take care of equality between first column of A and smallUserIDList to 
%// find the matching row and column indices.
%// NOTE: This corresponds to "A(:,1) == smallUserIDList(i)" from OP.
[R,C] = find(bsxfun(@eq,A(:,1),smallUserIDList.')); %//'

%// Take care of non-equality between second column of A and smallItemIDList. 
%// NOTE: This corresponds to SETDIFF in the original loopy code from OP.
mask1 = ~ismember(A(R,2),smallItemIDList);

AR2 = A(R,2); %// Elements from 2nd col of A that has matches from first step

%// Get only those elements from C and AR2 that has ONES in mask1
C1 = C(mask1);
AR2 = AR2(mask1);

%// Initialized output array
userStat = zeros(numel(smallUserIDList),1);

if ~isempty(C1)%//There is at least one element in C, so do further processing
    
    %// Find the count of duplicate elements for each ID in C1 indexed into AR2.
    %// NOTE: This corresponds to "unique(A2(:,2))" from OP.
    dup_counts = accumarray(C1,AR2,[],@(x) numel(x)-numel(unique(x)));
    
    %// Get the count of matches for each ID in C in the mask1.
    %// NOTE: This corresponds to:
    %//       "length(setdiff(itemIDList_each , smallItemIDList))" from OP.
    accums = accumarray(C,mask1);
    
    %// Store the counts in output array and also subtract the dup counts
    userStat(1:numel(accums)) = accums;
    userStat(1:numel(dup_counts)) = userStat(1:numel(dup_counts)) - dup_counts;
end
标杆管理 下面列出的代码将建议方法的运行时与原始循环代码进行比较-

%// Size parameters and random inputs with them
A_nrows    = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
A = [randi(max_userID,A_nrows,1) randi(max_itemID,A_nrows,1) randi(5,A_nrows,2)];
smallUserIDList = randi(max_userID,IDlist_len,1);
smallItemIDList = randi(max_itemID,IDlist_len,1);

disp('---------------------------- With Original Approach')
tic
%//   Original posted code
toc

disp('---------------------------- With Proposed Approach'))
tic
%//   Proposed approach code
toc
使用三组数据集获得的运行时如下所示:-

案例1:

案例2:

案例3:

结论:与原始循环代码相比,所提出的方法的加速似乎是巨大的

香草MATLAB: 据我所知,您的代码相当于:

%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));

%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];

%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);

%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);
如果每个用户最多有一个评分,则此选项有效,项目组合。而且它应该是相当有效的

无需重新设计车轮即可实现干净的进近: 从统计工具箱中查看! 实现可能类似于此:

%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});

%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);

%// Keep only users we care about (smallUserIDList) 
ratings = ratings(ismember(ratings.user, smallUserIDList),:);

%// Compute the statistics grouped by 'user'. 
userStat = grpstats(ratings, 'user');
香草MATLAB: 据我所知,您的代码相当于:

%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));

%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];

%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);

%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);
如果每个用户最多有一个评分,则此选项有效,项目组合。而且它应该是相当有效的

无需重新设计车轮即可实现干净的进近: 从统计工具箱中查看! 实现可能类似于此:

%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});

%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);

%// Keep only users we care about (smallUserIDList) 
ratings = ratings(ismember(ratings.user, smallUserIDList),:);

%// Compute the statistics grouped by 'user'. 
userStat = grpstats(ratings, 'user');

如果您添加了示例数据和预期输出,这样人们就可以比较他们的答案了。我想知道如果您将计算放在函数的循环中,是否会有所帮助-这样优化例程将识别您只关心userStat,而不会将其他变量复制到工作区中。是吗可能会有两个条目具有相同的用户ID和相同的项目ID,但评级不同?如果不是,只需构建一个稀疏矩阵。@kkuilla Hi!好主意我已经添加了一个示例数据和输出,以使问题更加明确。如果您添加了示例数据和预期输出,这样人们就可以比较他们的答案,这将是一件好事。我想知道如果您将计算放在函数的循环中,是否会有所帮助-这样优化例程将只识别您关心userStat,不会将其他变量复制到工作区中。是否可能有两个条目具有相同的userID和相同的itemID,但评级不同?如果不是,只需构建一个稀疏矩阵。@kkuilla Hi!好主意,我添加了一个示例数据和输出,以使问题更加明确。我喜欢使用表格,但生成所有SmallUserIdle、SmallItemIdle列表有点过分。顺便说一句:即使更正了多余的括号和缺少的逗号,代码不起作用,因为在setdiff行中矩阵的列数不相等。@alexandre iolov,你好!谢谢你的回答!!我尝试了您的代码并修改了Asub=aimembera:,1,smallUserIDList,:;对于Asub=AismemberA:,1,smallUserIdleist,1:2;,Var2到Akeep2,Var1到Akeep1,那么它就可以工作了!!如果使用我的示例数据,结果是userStat=12,这与我期望的输出略有不同。不过,我可以从你的代码中学到很多新方法,非常感谢@knedlsepp-感谢您的更正。这两个矩阵都应该有两列,但我肯定没有尝试运行代码——因为我没有一个smallUserIDList和SmallItemIdList的示例

他并没有费心去发明可信的。你将如何避免笛卡尔积?@alexandreiolov:看看我关于表格的回答。ismember步骤也可以应用于您的答案中。我认为最终的ratings变量应该与您的array2tableAkeep匹配。我喜欢使用表,但生成所有smallUserIDList、smallItemIDList有点过分。顺便说一句:即使更正了多余的括号和缺少的逗号,代码不起作用,因为在setdiff行中矩阵的列数不相等。@alexandre iolov,你好!谢谢你的回答!!我尝试了您的代码并修改了Asub=aimembera:,1,smallUserIDList,:;对于Asub=AismemberA:,1,smallUserIdleist,1:2;,Var2到Akeep2,Var1到Akeep1,那么它就可以工作了!!如果使用我的示例数据,结果是userStat=12,这与我期望的输出略有不同。不过,我可以从你的代码中学到很多新方法,非常感谢@knedlsepp-感谢您的更正。这两个矩阵都应该有两列,但我肯定没有尝试运行代码——因为我没有一个,smallUserIDList,smallItemIDList的示例,也没有费心去发明合理的列。你将如何避免笛卡尔积?@alexandreiolov:看看我关于表格的回答。ismember步骤也可以应用于您的答案中。我认为最终评级变量应该与你的array2tableAkeep匹配。嗨!我不得不说你的代码太快了!!我使用它,数据处理时间至少减少了20秒!!难以置信的事实上,setdiff只是我代码的一个分支,我有另一个函数,它是intersect而不是setdiff,如果setdiff被intersect替换,你能帮忙吗?非常感谢@archenoo这真是一个很棒的加速!!嗯,我不确定这与这个问题中的代码有什么不同。作为一个新问题发布怎么样?我发布了另一个与此相关的问题,希望得到答案,谢谢!!你好我不得不说你的代码太快了!!我使用它,数据处理时间至少减少了20秒!!难以置信的事实上,setdiff只是我代码的一个分支,我有另一个函数,它是intersect而不是setdiff,如果setdiff被intersect替换,你能帮忙吗?非常感谢@archenoo这真是一个很棒的加速!!嗯,我不确定这与这个问题中的代码有什么不同。作为一个新问题发布怎么样?我发布了另一个与此相关的问题,希望得到答案,谢谢!!看起来很有效,而且确实很快!请注意,请使用full来包装userStat,以便将数字数组作为稀疏方法的输出。@knedlsepp Hi!很抱歉延迟回复!!起初,我尝试了您的第一个代码,但并不是因为我不小心输入了一个错误的变量名!!现在,我发现了这个错误,我可以感谢你的回答,因为它真的提高了我的跑步速度,又减少了20秒!!真令人震惊!!你想看看我的另外两个问题吗,第一个是添加在问题链接中,第二个是如果setIntersect=intersectitemIDList\u each,smallItemIDList,如何修改你的问题;userStati=长度设置相交;对userStati=lengthitemIDList\u;的更改;。再次感谢你!!看起来很有效,而且确实很快!请注意,请使用full来包装userStat,以便将数字数组作为稀疏方法的输出。@knedlsepp Hi!很抱歉延迟回复!!起初,我尝试了您的第一个代码,但并不是因为我不小心输入了一个错误的变量名!!现在,我发现了这个错误,我可以感谢你的回答,因为它真的提高了我的跑步速度,又减少了20秒!!真令人震惊!!你想看看我的另外两个问题吗,第一个是添加在问题链接中,第二个是如果setIntersect=intersectitemIDList\u each,smallItemIDList,如何修改你的问题;userStati=长度设置相交;对userStati=lengthitemIDList\u;的更改;。再次感谢你!!