如何在Matlab中矢量化搜索函数?
下面是一个与intersect not setdiff版本稍有不同的Matlab编码问题: 带有3列的评级矩阵a,第1列为可能重复的用户ID,第2列为可能重复的项目ID,第3列为从用户到项目的评级,范围为1到5 现在,我有一个子集user id smallUserIDList和一个子集item id smallItemIDList,然后我想在smallUserIDList中查找由用户评级的行,收集用户评级的项,并进行一些计算,例如setdiff with smallItemIDList并计算结果,如下代码所示:如何在Matlab中矢量化搜索函数?,matlab,optimization,matrix,vectorization,Matlab,Optimization,Matrix,Vectorization,下面是一个与intersect not setdiff版本稍有不同的Matlab编码问题: 带有3列的评级矩阵a,第1列为可能重复的用户ID,第2列为可能重复的项目ID,第3列为从用户到项目的评级,范围为1到5 现在,我有一个子集user id smallUserIDList和一个子集item id smallItemIDList,然后我想在smallUserIDList中查找由用户评级的行,收集用户评级的项,并进行一些计算,例如setdiff with smallItemIDList并计算结果
userStat = zeros(length(smallUserIDList), 1);
for i = 1:length(smallUserIDList)
A2= A(A(:,1) == smallUserIDList(i), :);
itemIDList_each = unique(A2(:,2));
setDiff = setdiff(itemIDList_each , smallItemIDList);
userStat(i) = length(setDiff);
end
userStat
最后,我发现profileviewer显示上面的循环效率很低,问题是如何通过矢量化改进这段代码,但有for循环的帮助
例如:
输入:
输出:
我认为您正在尝试删除一部分用户的固定评分集,并计算剩余评分的数量: 以下工作是否有效:
Asub = A(ismember(A(:,1), smallUserIDList),1:2);
Bremove = allcomb(smallUserIDList, smallItemIDList);
Akeep = setdiff(Asub, Bremove, 'rows');
T = varfun(@sum, array2table(Akeep), 'InputVariables', 'Akeep2', 'GroupingVariables', 'Akeep1');
% userStat = T.GroupCount;
您需要matlab central文件交换中的allcomb函数,它给出了两个向量的笛卡尔乘积,并且很容易实现。我认为您正在尝试删除一部分用户的固定评分集,并计算剩余评分数: 以下工作是否有效:
Asub = A(ismember(A(:,1), smallUserIDList),1:2);
Bremove = allcomb(smallUserIDList, smallItemIDList);
Akeep = setdiff(Asub, Bremove, 'rows');
T = varfun(@sum, array2table(Akeep), 'InputVariables', 'Akeep2', 'GroupingVariables', 'Akeep1');
% userStat = T.GroupCount;
您需要matlab central文件交换中的allcomb函数,它给出了两个向量的笛卡尔积,并且易于实现。这可能是一种向量化方法-
%// Take care of equality between first column of A and smallUserIDList to
%// find the matching row and column indices.
%// NOTE: This corresponds to "A(:,1) == smallUserIDList(i)" from OP.
[R,C] = find(bsxfun(@eq,A(:,1),smallUserIDList.')); %//'
%// Take care of non-equality between second column of A and smallItemIDList.
%// NOTE: This corresponds to SETDIFF in the original loopy code from OP.
mask1 = ~ismember(A(R,2),smallItemIDList);
AR2 = A(R,2); %// Elements from 2nd col of A that has matches from first step
%// Get only those elements from C and AR2 that has ONES in mask1
C1 = C(mask1);
AR2 = AR2(mask1);
%// Initialized output array
userStat = zeros(numel(smallUserIDList),1);
if ~isempty(C1)%//There is at least one element in C, so do further processing
%// Find the count of duplicate elements for each ID in C1 indexed into AR2.
%// NOTE: This corresponds to "unique(A2(:,2))" from OP.
dup_counts = accumarray(C1,AR2,[],@(x) numel(x)-numel(unique(x)));
%// Get the count of matches for each ID in C in the mask1.
%// NOTE: This corresponds to:
%// "length(setdiff(itemIDList_each , smallItemIDList))" from OP.
accums = accumarray(C,mask1);
%// Store the counts in output array and also subtract the dup counts
userStat(1:numel(accums)) = accums;
userStat(1:numel(dup_counts)) = userStat(1:numel(dup_counts)) - dup_counts;
end
标杆管理
下面列出的代码将建议方法的运行时与原始循环代码进行比较-
%// Size parameters and random inputs with them
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
A = [randi(max_userID,A_nrows,1) randi(max_itemID,A_nrows,1) randi(5,A_nrows,2)];
smallUserIDList = randi(max_userID,IDlist_len,1);
smallItemIDList = randi(max_itemID,IDlist_len,1);
disp('---------------------------- With Original Approach')
tic
%// Original posted code
toc
disp('---------------------------- With Proposed Approach'))
tic
%// Proposed approach code
toc
使用三组数据集获得的运行时如下所示:-
案例1:
案例2:
案例3:
结论:与原始循环代码相比,所提出的方法的加速似乎是巨大的 这可能是一种矢量化方法-
%// Take care of equality between first column of A and smallUserIDList to
%// find the matching row and column indices.
%// NOTE: This corresponds to "A(:,1) == smallUserIDList(i)" from OP.
[R,C] = find(bsxfun(@eq,A(:,1),smallUserIDList.')); %//'
%// Take care of non-equality between second column of A and smallItemIDList.
%// NOTE: This corresponds to SETDIFF in the original loopy code from OP.
mask1 = ~ismember(A(R,2),smallItemIDList);
AR2 = A(R,2); %// Elements from 2nd col of A that has matches from first step
%// Get only those elements from C and AR2 that has ONES in mask1
C1 = C(mask1);
AR2 = AR2(mask1);
%// Initialized output array
userStat = zeros(numel(smallUserIDList),1);
if ~isempty(C1)%//There is at least one element in C, so do further processing
%// Find the count of duplicate elements for each ID in C1 indexed into AR2.
%// NOTE: This corresponds to "unique(A2(:,2))" from OP.
dup_counts = accumarray(C1,AR2,[],@(x) numel(x)-numel(unique(x)));
%// Get the count of matches for each ID in C in the mask1.
%// NOTE: This corresponds to:
%// "length(setdiff(itemIDList_each , smallItemIDList))" from OP.
accums = accumarray(C,mask1);
%// Store the counts in output array and also subtract the dup counts
userStat(1:numel(accums)) = accums;
userStat(1:numel(dup_counts)) = userStat(1:numel(dup_counts)) - dup_counts;
end
标杆管理
下面列出的代码将建议方法的运行时与原始循环代码进行比较-
%// Size parameters and random inputs with them
A_nrows = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
A = [randi(max_userID,A_nrows,1) randi(max_itemID,A_nrows,1) randi(5,A_nrows,2)];
smallUserIDList = randi(max_userID,IDlist_len,1);
smallItemIDList = randi(max_itemID,IDlist_len,1);
disp('---------------------------- With Original Approach')
tic
%// Original posted code
toc
disp('---------------------------- With Proposed Approach'))
tic
%// Proposed approach code
toc
使用三组数据集获得的运行时如下所示:-
案例1:
案例2:
案例3:
结论:与原始循环代码相比,所提出的方法的加速似乎是巨大的 香草MATLAB:
据我所知,您的代码相当于:
%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));
%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];
%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);
%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);
如果每个用户最多有一个评分,则此选项有效,项目组合。而且它应该是相当有效的
无需重新设计车轮即可实现干净的进近:
从统计工具箱中查看!
实现可能类似于此:
%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});
%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);
%// Keep only users we care about (smallUserIDList)
ratings = ratings(ismember(ratings.user, smallUserIDList),:);
%// Compute the statistics grouped by 'user'.
userStat = grpstats(ratings, 'user');
香草MATLAB:
据我所知,您的代码相当于:
%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));
%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];
%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);
%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);
如果每个用户最多有一个评分,则此选项有效,项目组合。而且它应该是相当有效的
无需重新设计车轮即可实现干净的进近:
从统计工具箱中查看!
实现可能类似于此:
%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});
%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);
%// Keep only users we care about (smallUserIDList)
ratings = ratings(ismember(ratings.user, smallUserIDList),:);
%// Compute the statistics grouped by 'user'.
userStat = grpstats(ratings, 'user');
如果您添加了示例数据和预期输出,这样人们就可以比较他们的答案了。我想知道如果您将计算放在函数的循环中,是否会有所帮助-这样优化例程将识别您只关心userStat,而不会将其他变量复制到工作区中。是吗可能会有两个条目具有相同的用户ID和相同的项目ID,但评级不同?如果不是,只需构建一个稀疏矩阵。@kkuilla Hi!好主意我已经添加了一个示例数据和输出,以使问题更加明确。如果您添加了示例数据和预期输出,这样人们就可以比较他们的答案,这将是一件好事。我想知道如果您将计算放在函数的循环中,是否会有所帮助-这样优化例程将只识别您关心userStat,不会将其他变量复制到工作区中。是否可能有两个条目具有相同的userID和相同的itemID,但评级不同?如果不是,只需构建一个稀疏矩阵。@kkuilla Hi!好主意,我添加了一个示例数据和输出,以使问题更加明确。我喜欢使用表格,但生成所有SmallUserIdle、SmallItemIdle列表有点过分。顺便说一句:即使更正了多余的括号和缺少的逗号,代码不起作用,因为在setdiff行中矩阵的列数不相等。@alexandre iolov,你好!谢谢你的回答!!我尝试了您的代码并修改了Asub=aimembera:,1,smallUserIDList,:;对于Asub=AismemberA:,1,smallUserIdleist,1:2;,Var2到Akeep2,Var1到Akeep1,那么它就可以工作了!!如果使用我的示例数据,结果是userStat=12,这与我期望的输出略有不同。不过,我可以从你的代码中学到很多新方法,非常感谢@knedlsepp-感谢您的更正。这两个矩阵都应该有两列,但我肯定没有尝试运行代码——因为我没有一个smallUserIDList和SmallItemIdList的示例
他并没有费心去发明可信的。你将如何避免笛卡尔积?@alexandreiolov:看看我关于表格的回答。ismember步骤也可以应用于您的答案中。我认为最终的ratings变量应该与您的array2tableAkeep匹配。我喜欢使用表,但生成所有smallUserIDList、smallItemIDList有点过分。顺便说一句:即使更正了多余的括号和缺少的逗号,代码不起作用,因为在setdiff行中矩阵的列数不相等。@alexandre iolov,你好!谢谢你的回答!!我尝试了您的代码并修改了Asub=aimembera:,1,smallUserIDList,:;对于Asub=AismemberA:,1,smallUserIdleist,1:2;,Var2到Akeep2,Var1到Akeep1,那么它就可以工作了!!如果使用我的示例数据,结果是userStat=12,这与我期望的输出略有不同。不过,我可以从你的代码中学到很多新方法,非常感谢@knedlsepp-感谢您的更正。这两个矩阵都应该有两列,但我肯定没有尝试运行代码——因为我没有一个,smallUserIDList,smallItemIDList的示例,也没有费心去发明合理的列。你将如何避免笛卡尔积?@alexandreiolov:看看我关于表格的回答。ismember步骤也可以应用于您的答案中。我认为最终评级变量应该与你的array2tableAkeep匹配。嗨!我不得不说你的代码太快了!!我使用它,数据处理时间至少减少了20秒!!难以置信的事实上,setdiff只是我代码的一个分支,我有另一个函数,它是intersect而不是setdiff,如果setdiff被intersect替换,你能帮忙吗?非常感谢@archenoo这真是一个很棒的加速!!嗯,我不确定这与这个问题中的代码有什么不同。作为一个新问题发布怎么样?我发布了另一个与此相关的问题,希望得到答案,谢谢!!你好我不得不说你的代码太快了!!我使用它,数据处理时间至少减少了20秒!!难以置信的事实上,setdiff只是我代码的一个分支,我有另一个函数,它是intersect而不是setdiff,如果setdiff被intersect替换,你能帮忙吗?非常感谢@archenoo这真是一个很棒的加速!!嗯,我不确定这与这个问题中的代码有什么不同。作为一个新问题发布怎么样?我发布了另一个与此相关的问题,希望得到答案,谢谢!!看起来很有效,而且确实很快!请注意,请使用full来包装userStat,以便将数字数组作为稀疏方法的输出。@knedlsepp Hi!很抱歉延迟回复!!起初,我尝试了您的第一个代码,但并不是因为我不小心输入了一个错误的变量名!!现在,我发现了这个错误,我可以感谢你的回答,因为它真的提高了我的跑步速度,又减少了20秒!!真令人震惊!!你想看看我的另外两个问题吗,第一个是添加在问题链接中,第二个是如果setIntersect=intersectitemIDList\u each,smallItemIDList,如何修改你的问题;userStati=长度设置相交;对userStati=lengthitemIDList\u;的更改;。再次感谢你!!看起来很有效,而且确实很快!请注意,请使用full来包装userStat,以便将数字数组作为稀疏方法的输出。@knedlsepp Hi!很抱歉延迟回复!!起初,我尝试了您的第一个代码,但并不是因为我不小心输入了一个错误的变量名!!现在,我发现了这个错误,我可以感谢你的回答,因为它真的提高了我的跑步速度,又减少了20秒!!真令人震惊!!你想看看我的另外两个问题吗,第一个是添加在问题链接中,第二个是如果setIntersect=intersectitemIDList\u each,smallItemIDList,如何修改你的问题;userStati=长度设置相交;对userStati=lengthitemIDList\u;的更改;。再次感谢你!!