Arrays 查找数组中所有重复元素的所有索引_Arrays_Performance_Matlab

Arrays 查找数组中所有重复元素的所有索引

arrays performance matlab

Arrays 查找数组中所有重复元素的所有索引,arrays,performance,matlab,Arrays,Performance,Matlab,给定一个整数数组，查找其中所有重复元素的所有索引例如，考虑 A=（4, 12, 9，8, 9, 12，7, 1）< /代码>。由于12和9具有重复项，因此将返回它们的所有索引，即d=[2,3,5,6]。A的长度小于200，整数元素在1到5000之间目前我正在使用以下功能。然而，为了满足我的需求，这个功能正在减慢我的速度。是否有任何可能的性能改进函数d=fincDuplicates（A） U=唯一（A）； [co，ce]=hist（A，U）； an=ce（co>1）； d=[]；对于i=1

给定一个整数数组，查找其中所有重复元素的所有索引

例如，考虑<代码> A=（4, 12, 9，8, 9, 12，7, 1）< /代码>。由于12和9具有重复项，因此将返回它们的所有索引，即

d=[2,3,5,6]

。

的长度小于200，整数元素在1到5000之间

目前我正在使用以下功能。然而，为了满足我的需求，这个功能正在减慢我的速度。是否有任何可能的性能改进

函数d=fincDuplicates（A）
U=唯一（A）；
[co，ce]=hist（A，U）；
an=ce（co>1）；
d=[]；
对于i=1:numel（an）
d=[d，find（A==an（i））]；
结束
结束

这里有一个解决方案（来源：这是改编自）

它提供了以下内容：

>> idx_dup = idx_A(idxkeep)

idx_dup =

     2     3     5     6

不确定它是否比您当前的解决方案更高效。您可能需要使用真实数据对其进行测试。

编辑：

1：修正了注释中突出显示的边缘情况的代码，更新了基准

2：在基准测试中添加了“扩展”解决方案（必须将最大N元素减少到20000）

3：在基准测试中添加了

accumarray

方法（高N优胜者），以及

sparse

方法

下面是获得结果的另一种方法，无需使用函数

unique

或

hist

。它依赖于函数

排序

以展开形式（如果要查看中间步骤的结果）：

您可以将其压缩为：

[B,I] = sort(A) ;
dx = find(diff(B)==0) ;
if ~isempty(dx)
    d = I([dx dx([diff(dx)~=1,true])+1]) ;
else
    d = [] ;
end

给出：

d =
     3     2     5     6

个人分析，我也会对返回的索引进行排序，但如果没有必要，并且您担心性能，那么您可以接受未排序的结果

下面是另一个基准（测试10到20000个元素的数量）：

在MatlabR2016A上运行

以及它的代码：

function ExecTimes = benchmark_findDuplicates

nOrder = (1:9).' * 10.^(1:3) ; nOrder = [nOrder(:) ; 10000 ; 20000 ] ;
npt = numel(nOrder) ;

ExecTimes = zeros(npt,6) ;

for k = 1:npt
    % Sample data
    N = nOrder(k) ;
    A = randi(5000,[1,N]) ;

    % Benchmark
    f1 = @() findDuplicates_histMethod(A) ;
    f2 = @() findDuplicates_histcountMethod(A) ;
    f3 = @() findDuplicates_sortMethod(A) ;
    f4 = @() findDuplicates_expansionMethod(A) ;
    f5 = @() findDuplicates_accumarrayMethod(A) ;
    f6 = @() findDuplicates_sparseMethod(A) ;
    ExecTimes(k,1) = timeit( f1 ) ;
    ExecTimes(k,2) = timeit( f2 ) ;
    ExecTimes(k,3) = timeit( f3 ) ;
    ExecTimes(k,4) = timeit( f4 ) ;
    ExecTimes(k,5) = timeit( f5 ) ;
    ExecTimes(k,6) = timeit( f6 ) ;

    clear A
    disp(N)
end

function d = findDuplicates_histMethod(A)
    U = unique(A);
    [co,ce] = hist(A,U);
    an = ce(co>1);
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

function d = findDuplicates_histcountMethod(A)
    [~,idxu,idxc] = unique(A);
    [count, ~, idxcount] = histcounts(idxc,numel(idxu));
    idxkeep = count(idxcount)>1;
    idx_A = 1:length(A);
    d = idx_A(idxkeep);
end

function d = findDuplicates_sortMethod(A)
    [B,I] = sort(A) ;
    dx = find(diff(B)==0) ;
    if ~isempty(dx)
        d = I([dx dx([diff(dx)~=1,true])+1]) ;
    else
        d=[];
    end
end

function d = findDuplicates_expansionMethod(A)
    Ae = ones(numel(A),1) * A ;
    d = find(sum(Ae==Ae.')>1) ;
end

function d = findDuplicates_accumarrayMethod(A)
    d = find(ismember(A, find(accumarray(A(:), 1)>1))) ;
end

function d = findDuplicates_sparseMethod(A)
    d = find(ismember(A, find(sparse(A, 1, 1)>1)));
end

end

为了完整性，这里是其他答案的结果，与你的答案相比，你的答案加快了（在更好的人来拯救你之前，我一直在努力）。对于您问题中的尺寸：

for ii=1:100
    a=randi(5000,1,200);
    t1(ii)=timeit(@()yours(a));

    a=randi(5000,1,200);
    t2(ii)=timeit(@()faster(a));

    a=randi(5000,1,200);
    t3(ii)=timeit(@()hoki(a));

    a=randi(5000,1,200);
    t4(ii)=timeit(@()am304(a));
end
disp(['Faster: x', num2str(mean(t1)/mean(t2))])
disp(['hoki: x', num2str(mean(t1)/mean(t3))])
disp(['am304: x', num2str(mean(t1)/mean(t4))])
disp(['Faster: x', num2str(t1/t2)])
disp(['hoki: x', num2str(t1/t3)])
disp(['am304: x', num2str(t1/t4)])
function d = yours(A)
    U = unique(A);
    [co,ce] = hist(A,U);
    an = ce(co>1);
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

function d = faster(A)
    [co] = histcounts(A,max(A));
    an = co>1;
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

function res=am304(A)
[~,idxu,idxc] = unique(A);
[count, ~, idxcount] = histcounts(idxc,numel(idxu));
idxkeep = count(idxcount)>1;
idx_A = 1:length(A);
res = idx_A(idxkeep);
end

function res=hoki(A)
[B,I] = sort(A) ;
dx = find(diff(B)==0) ;
res = I([dx dx+1]) ;
end

结果是：

Faster: x0.0054505
hoki: x7.4142
am304: x1.0881

我的快速版本在这种情况下失败得很惨

我知道Hoki的答案在大型阵列中是最快的，但在小型阵列中要快得多，根据

的大小和范围，它比am304快2~30倍（加上am304的答案）。

我来晚了一点，但这个问题需要一个基于的解决方案：-）
这利用了
A
包含小的正整数的事实，因此它们可以被解释为索引其工作原理如下：

accumarray(A(:), 1) % count of occurrences of each value find( >1) % values occurring more than once d = find(ismember(A, ); % their positions in A
或者，可以使用代替accumarray：

d = find(ismember(A, find(sparse(A, 1, 1)>1)));

分析器显示此函数使用了总时间的60%（抱歉，我无法共享我的全部代码）。功能
unique
和
hist
是主要原因。这里我调用
hist
查找数组中唯一元素的频率。以及查找频率大于1的元素的索引。这是一个非常短但内存非常有限的解决方案，对于长度（A）<200:
d=find（sum（A==A.）>1）
@obchardon到目前为止，您的解决方案是最好的。当length（A）@obchardon时，我使用==不断得到
错误，矩阵维度必须一致。
如果我尝试你的方法。我错过了什么？它依赖于隐式扩展吗？@Hoki它确实涉及隐式扩展（matlab R2016b及以上/倍频程3.6.0及以上）。不幸的是，该解决方案现在使用了总时间的70%左右。@Inpopularguy哦，值得一试。总时间是一样的吗？因为如果总时间大幅减少，70%的时间可能仍比更长时间的60%快。我的实现大约花了15秒，你的实现大约花了19秒。@不受欢迎这很奇怪，我的测量结果完全相反，正如你在我的回答中所看到的。它必须取决于数组大小。然而，这种测量是不可信的。如果我们可以为每个单独的函数调用生成一个新的随机数组，那就更好了。@impopularGuy我还没有发布任何特定的时间，所以您可以自己复制粘贴3行；）顺便说一句，结果仍然相同。@impopularGuy如果您为每个单独的函数调用生成一个新的随机数组，那么您不是在对同类进行比较。要比较函数的性能，它们需要在相同的数据上操作，否则就没有意义了。@am304现在重复了100次，所以希望差异能够平均。@am304当然它们需要在相同的数据上操作。我可能无法正确解释。这里有一个很大的缺陷。对于
A=[1,2,4,1,2,5,1,2,6]，答案是d=[1,4,7,2,5,8] 。但是它返回，d=[1,4,2,5,4,7,5,8] 。有趣的是，我们在没有测试所有测试用例的情况下开始基准测试。哎哟，这是正确的索引。。。但也重复了。该死在输出上运行unique ，肯定会降低性能。现在无法尝试，如果您想编辑，请随意编辑。@unpopularguy，已更正。通过使用unique 解决方案，在元素数量非常多的情况下，性能不如histcount 方法，但是通过重复使用相同的技巧来检测原始数组和重复索引数组中的重复，我设法使解决方案的执行时间低于其他解决方案。您能否将基于我的accumarray的解决方案添加到基准测试中？@LuisMendo，完成。通常，您会提供高N的顶级解决方案；-） d = find(ismember(A, find(accumarray(A(:), 1)>1))); accumarray(A(:), 1) % count of occurrences of each value find( >1) % values occurring more than once d = find(ismember(A, ); % their positions in A d = find(ismember(A, find(sparse(A, 1, 1)>1)));