如何在Matlab中矢量化搜索函数？_Matlab_Optimization_Matrix_Vectorization

如何在Matlab中矢量化搜索函数？

matlab optimization matrix

如何在Matlab中矢量化搜索函数？,matlab,optimization,matrix,vectorization,Matlab,Optimization,Matrix,Vectorization,下面是一个与intersect not setdiff版本稍有不同的Matlab编码问题：带有3列的评级矩阵a，第1列为可能重复的用户ID，第2列为可能重复的项目ID，第3列为从用户到项目的评级，范围为1到5 现在，我有一个子集user id smallUserIDList和一个子集item id smallItemIDList，然后我想在smallUserIDList中查找由用户评级的行，收集用户评级的项，并进行一些计算，例如setdiff with smallItemIDList并计算结果

下面是一个与intersect not setdiff版本稍有不同的Matlab编码问题：

带有3列的评级矩阵a，第1列为可能重复的用户ID，第2列为可能重复的项目ID，第3列为从用户到项目的评级，范围为1到5

现在，我有一个子集user id smallUserIDList和一个子集item id smallItemIDList，然后我想在smallUserIDList中查找由用户评级的行，收集用户评级的项，并进行一些计算，例如setdiff with smallItemIDList并计算结果，如下代码所示：

userStat = zeros(length(smallUserIDList), 1);
for i = 1:length(smallUserIDList)
    A2= A(A(:,1) == smallUserIDList(i), :);
    itemIDList_each = unique(A2(:,2));

    setDiff = setdiff(itemIDList_each , smallItemIDList);
    userStat(i) = length(setDiff);
end
userStat

最后，我发现profileviewer显示上面的循环效率很低，问题是如何通过矢量化改进这段代码，但有for循环的帮助

例如：

输入：

输出：

我认为您正在尝试删除一部分用户的固定评分集，并计算剩余评分的数量：

以下工作是否有效：

Asub = A(ismember(A(:,1), smallUserIDList),1:2);
Bremove = allcomb(smallUserIDList, smallItemIDList);
Akeep = setdiff(Asub, Bremove, 'rows');
T = varfun(@sum, array2table(Akeep), 'InputVariables', 'Akeep2', 'GroupingVariables', 'Akeep1');
% userStat = T.GroupCount;

您需要matlab central文件交换中的allcomb函数，它给出了两个向量的笛卡尔乘积，并且很容易实现。

我认为您正在尝试删除一部分用户的固定评分集，并计算剩余评分数：

以下工作是否有效：

Asub = A(ismember(A(:,1), smallUserIDList),1:2);
Bremove = allcomb(smallUserIDList, smallItemIDList);
Akeep = setdiff(Asub, Bremove, 'rows');
T = varfun(@sum, array2table(Akeep), 'InputVariables', 'Akeep2', 'GroupingVariables', 'Akeep1');
% userStat = T.GroupCount;

您需要matlab central文件交换中的allcomb函数，它给出了两个向量的笛卡尔积，并且易于实现。

这可能是一种向量化方法-

%// Take care of equality between first column of A and smallUserIDList to 
%// find the matching row and column indices.
%// NOTE: This corresponds to "A(:,1) == smallUserIDList(i)" from OP.
[R,C] = find(bsxfun(@eq,A(:,1),smallUserIDList.')); %//'

%// Take care of non-equality between second column of A and smallItemIDList. 
%// NOTE: This corresponds to SETDIFF in the original loopy code from OP.
mask1 = ~ismember(A(R,2),smallItemIDList);

AR2 = A(R,2); %// Elements from 2nd col of A that has matches from first step

%// Get only those elements from C and AR2 that has ONES in mask1
C1 = C(mask1);
AR2 = AR2(mask1);

%// Initialized output array
userStat = zeros(numel(smallUserIDList),1);

if ~isempty(C1)%//There is at least one element in C, so do further processing
    
    %// Find the count of duplicate elements for each ID in C1 indexed into AR2.
    %// NOTE: This corresponds to "unique(A2(:,2))" from OP.
    dup_counts = accumarray(C1,AR2,[],@(x) numel(x)-numel(unique(x)));
    
    %// Get the count of matches for each ID in C in the mask1.
    %// NOTE: This corresponds to:
    %//       "length(setdiff(itemIDList_each , smallItemIDList))" from OP.
    accums = accumarray(C,mask1);
    
    %// Store the counts in output array and also subtract the dup counts
    userStat(1:numel(accums)) = accums;
    userStat(1:numel(dup_counts)) = userStat(1:numel(dup_counts)) - dup_counts;
end

标杆管理下面列出的代码将建议方法的运行时与原始循环代码进行比较-

%// Size parameters and random inputs with them
A_nrows    = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
A = [randi(max_userID,A_nrows,1) randi(max_itemID,A_nrows,1) randi(5,A_nrows,2)];
smallUserIDList = randi(max_userID,IDlist_len,1);
smallItemIDList = randi(max_itemID,IDlist_len,1);

disp('---------------------------- With Original Approach')
tic
%//   Original posted code
toc

disp('---------------------------- With Proposed Approach'))
tic
%//   Proposed approach code
toc

使用三组数据集获得的运行时如下所示：-

案例1：

案例2：

案例3：

结论：与原始循环代码相比，所提出的方法的加速似乎是巨大的

这可能是一种矢量化方法-

%// Take care of equality between first column of A and smallUserIDList to 
%// find the matching row and column indices.
%// NOTE: This corresponds to "A(:,1) == smallUserIDList(i)" from OP.
[R,C] = find(bsxfun(@eq,A(:,1),smallUserIDList.')); %//'

%// Take care of non-equality between second column of A and smallItemIDList. 
%// NOTE: This corresponds to SETDIFF in the original loopy code from OP.
mask1 = ~ismember(A(R,2),smallItemIDList);

AR2 = A(R,2); %// Elements from 2nd col of A that has matches from first step

%// Get only those elements from C and AR2 that has ONES in mask1
C1 = C(mask1);
AR2 = AR2(mask1);

%// Initialized output array
userStat = zeros(numel(smallUserIDList),1);

if ~isempty(C1)%//There is at least one element in C, so do further processing
    
    %// Find the count of duplicate elements for each ID in C1 indexed into AR2.
    %// NOTE: This corresponds to "unique(A2(:,2))" from OP.
    dup_counts = accumarray(C1,AR2,[],@(x) numel(x)-numel(unique(x)));
    
    %// Get the count of matches for each ID in C in the mask1.
    %// NOTE: This corresponds to:
    %//       "length(setdiff(itemIDList_each , smallItemIDList))" from OP.
    accums = accumarray(C,mask1);
    
    %// Store the counts in output array and also subtract the dup counts
    userStat(1:numel(accums)) = accums;
    userStat(1:numel(dup_counts)) = userStat(1:numel(dup_counts)) - dup_counts;
end

标杆管理下面列出的代码将建议方法的运行时与原始循环代码进行比较-

%// Size parameters and random inputs with them
A_nrows    = 5000;
IDlist_len = 5000;
max_userID = 1000;
max_itemID = 1000;
A = [randi(max_userID,A_nrows,1) randi(max_itemID,A_nrows,1) randi(5,A_nrows,2)];
smallUserIDList = randi(max_userID,IDlist_len,1);
smallItemIDList = randi(max_itemID,IDlist_len,1);

disp('---------------------------- With Original Approach')
tic
%//   Original posted code
toc

disp('---------------------------- With Proposed Approach'))
tic
%//   Proposed approach code
toc

使用三组数据集获得的运行时如下所示：-

案例1：

案例2：

案例3：

结论：与原始循环代码相比，所提出的方法的加速似乎是巨大的

香草MATLAB：据我所知，您的代码相当于：

%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));

%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];

%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);

%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);

如果每个用户最多有一个评分，则此选项有效，项目组合。而且它应该是相当有效的

无需重新设计车轮即可实现干净的进近：从统计工具箱中查看！实现可能类似于此：

%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});

%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);

%// Keep only users we care about (smallUserIDList) 
ratings = ratings(ismember(ratings.user, smallUserIDList),:);

%// Compute the statistics grouped by 'user'. 
userStat = grpstats(ratings, 'user');

香草MATLAB：据我所知，您的代码相当于：

%// Create matrix such that: user_item_rating(user,item)==rating
user_item_rating = sparse(A(:,1),A(:,2),A(:,3));

%// Keep all BUT the items in smallItemIDList
user_item_rating(:,smallItemIDList) = [];

%// Keep only those users in `smallUserIDList` and use order of this list
user_item_rating = user_item_rating(smallUserIDList,:);

%// Count the number of ratings
userStat = sum(user_item_rating~=0, 2);

如果每个用户最多有一个评分，则此选项有效，项目组合。而且它应该是相当有效的

无需重新设计车轮即可实现干净的进近：从统计工具箱中查看！实现可能类似于此：

%// Create ratings table
ratings = array2table(A, 'VariableNames', {'user','item','rating'});

%// Remove items we don't care about (smallItemIDList)
ratings = ratings(~ismember(ratings.item, smallItemIDList),:);

%// Keep only users we care about (smallUserIDList) 
ratings = ratings(ismember(ratings.user, smallUserIDList),:);

%// Compute the statistics grouped by 'user'. 
userStat = grpstats(ratings, 'user');

如果您添加了示例数据和预期输出，这样人们就可以比较他们的答案了。我想知道如果您将计算放在函数的循环中，是否会有所帮助-这样优化例程将识别您只关心userStat，而不会将其他变量复制到工作区中。是吗可能会有两个条目具有相同的用户ID和相同的项目ID，但评级不同？如果不是，只需构建一个稀疏矩阵。@kkuilla Hi！好主意我已经添加了一个示例数据和输出，以使问题更加明确。如果您添加了示例数据和预期输出，这样人们就可以比较他们的答案，这将是一件好事。我想知道如果您将计算放在函数的循环中，是否会有所帮助-这样优化例程将只识别您关心userStat，不会将其他变量复制到工作区中。是否可能有两个条目具有相同的userID和相同的itemID，但评级不同？如果不是，只需构建一个稀疏矩阵。@kkuilla Hi！好主意，我添加了一个示例数据和输出，以使问题更加明确。我喜欢使用表格，但生成所有SmallUserIdle、SmallItemIdle列表有点过分。顺便说一句：即使更正了多余的括号和缺少的逗号，代码不起作用，因为在setdiff行中矩阵的列数不相等。@alexandre iolov，你好！谢谢你的回答！！我尝试了您的代码并修改了Asub=aimembera:，1，smallUserIDList，：；对于Asub=AismemberA:，1，smallUserIdleist，1:2；，Var2到Akeep2，Var1到Akeep1，那么它就可以工作了！！如果使用我的示例数据，结果是userStat=12，这与我期望的输出略有不同。不过，我可以从你的代码中学到很多新方法，非常感谢@knedlsepp-感谢您的更正。这两个矩阵都应该有两列，但我肯定没有尝试运行代码——因为我没有一个smallUserIDList和SmallItemIdList的示例

他并没有费心去发明可信的。你将如何避免笛卡尔积？@alexandreiolov：看看我关于表格的回答。ismember步骤也可以应用于您的答案中。我认为最终的ratings变量应该与您的array2tableAkeep匹配。我喜欢使用表，但生成所有smallUserIDList、smallItemIDList有点过分。顺便说一句：即使更正了多余的括号和缺少的逗号，代码不起作用，因为在setdiff行中矩阵的列数不相等。@alexandre iolov，你好！谢谢你的回答！！我尝试了您的代码并修改了Asub=aimembera:，1，smallUserIDList，：；对于Asub=AismemberA:，1，smallUserIdleist，1:2；，Var2到Akeep2，Var1到Akeep1，那么它就可以工作了！！如果使用我的示例数据，结果是userStat=12，这与我期望的输出略有不同。不过，我可以从你的代码中学到很多新方法，非常感谢@knedlsepp-感谢您的更正。这两个矩阵都应该有两列，但我肯定没有尝试运行代码——因为我没有一个，smallUserIDList，smallItemIDList的示例，也没有费心去发明合理的列。你将如何避免笛卡尔积？@alexandreiolov：看看我关于表格的回答。ismember步骤也可以应用于您的答案中。我认为最终评级变量应该与你的array2tableAkeep匹配。嗨！我不得不说你的代码太快了！！我使用它，数据处理时间至少减少了20秒！！难以置信的事实上，setdiff只是我代码的一个分支，我有另一个函数，它是intersect而不是setdiff，如果setdiff被intersect替换，你能帮忙吗？非常感谢@archenoo这真是一个很棒的加速！！嗯，我不确定这与这个问题中的代码有什么不同。作为一个新问题发布怎么样？我发布了另一个与此相关的问题，希望得到答案，谢谢！！你好我不得不说你的代码太快了！！我使用它，数据处理时间至少减少了20秒！！难以置信的事实上，setdiff只是我代码的一个分支，我有另一个函数，它是intersect而不是setdiff，如果setdiff被intersect替换，你能帮忙吗？非常感谢@archenoo这真是一个很棒的加速！！嗯，我不确定这与这个问题中的代码有什么不同。作为一个新问题发布怎么样？我发布了另一个与此相关的问题，希望得到答案，谢谢！！看起来很有效，而且确实很快！请注意，请使用full来包装userStat，以便将数字数组作为稀疏方法的输出。@knedlsepp Hi！很抱歉延迟回复！！起初，我尝试了您的第一个代码，但并不是因为我不小心输入了一个错误的变量名！！现在，我发现了这个错误，我可以感谢你的回答，因为它真的提高了我的跑步速度，又减少了20秒！！真令人震惊！！你想看看我的另外两个问题吗，第一个是添加在问题链接中，第二个是如果setIntersect=intersectitemIDList\u each，smallItemIDList，如何修改你的问题；userStati=长度设置相交；对userStati=lengthitemIDList\u；的更改；。再次感谢你！！看起来很有效，而且确实很快！请注意，请使用full来包装userStat，以便将数字数组作为稀疏方法的输出。@knedlsepp Hi！很抱歉延迟回复！！起初，我尝试了您的第一个代码，但并不是因为我不小心输入了一个错误的变量名！！现在，我发现了这个错误，我可以感谢你的回答，因为它真的提高了我的跑步速度，又减少了20秒！！真令人震惊！！你想看看我的另外两个问题吗，第一个是添加在问题链接中，第二个是如果setIntersect=intersectitemIDList\u each，smallItemIDList，如何修改你的问题；userStati=长度设置相交；对userStati=lengthitemIDList\u；的更改；。再次感谢你！！