Duplicates 朱莉娅：从数组中检测并删除重复行？_Duplicates_Julia

Duplicates 朱莉娅：从数组中检测并删除重复行？

julia

Duplicates 朱莉娅：从数组中检测并删除重复行？,duplicates,julia,Duplicates,Julia,在Julia中，从数组中检测和删除重复行的最佳方法是什么 x = Integer.(round.(10 .* rand(1000,4))) # In R I would apply the duplicated function. x = x[duplicated(x),:] 这是您正在寻找的：（这并不能回答检测部分的问题。）对于检测部分，将编辑一个脏补丁：致：或者，您可以使用上述更改定义自己的unique2： using Base.Cartesian import Base.Preh

在Julia中，从数组中检测和删除重复行的最佳方法是什么

x = Integer.(round.(10 .* rand(1000,4)))

# In R I would apply the duplicated function.
x = x[duplicated(x),:]

这是您正在寻找的：（这并不能回答检测部分的问题。）

对于检测部分，将编辑一个脏补丁：

致：

或者，您可以使用上述更改定义自己的

unique2

：

using Base.Cartesian
import Base.Prehashed

@generated function unique2(A::AbstractArray{T,N}, dim::Int) where {T,N}
......
end

julia> y, idx = unique2(x, 1)

julia> y
960×4 Array{Int64,2}:
  8   3   1   5
  8   3   1   6
  1   1   0   1
  8  10   1  10
  9   1   8   7
  ⋮ 

julia> setdiff(1:1000, idx)
40-element Array{Int64,1}:
  99
 120
 132
 140
 216
 227
  ⋮

我的机器上的基准是：

x = rand(1:10,1000,4) # 48 dups
@btime unique2($x, 1); 
124.342 μs (31 allocations: 145.97 KiB)
@btime duplicated($x);
407.809 μs (9325 allocations: 394.78 KiB) 

x = rand(1:4,1000,4) # 751 dups
@btime unique2($x, 1);
66.062 μs (25 allocations: 50.30 KiB)
@btime duplicated($x);
222.337 μs (4851 allocations: 237.88 KiB)

结果表明

Base

中的卷积元编程哈希表方式从较低的内存分配中获益匪浅。

您还可以选择：

duplicated(x) = foldl(
  (d,y)->(x[y,:] in d[1] ? (d[1],push!(d[2],y)) : (push!(d[1],x[y,:]),d[2])), 
  (Set(), Vector{Int}()), 
  1:size(x,1))[2]

这将收集一组已看到的行，并输出已看到的行的索引。这基本上是获得结果所需的最小努力，因此应该是快速的

julia> x = rand(1:2,5,2)
5×2 Array{Int64,2}:
 2  1
 1  2
 1  2
 1  1
 1  1

julia> duplicated(x)
2-element Array{Int64,1}:
 3
 5

julia> x[duplicated(x),:]
2×2 Array{Int64,2}:
 1  2
 1  1

Julia v1.4及更高版本，您需要键入

unique（a，dims=1）

其中

是您的N×2数组

julia> a=[2 2 ; 2 2; 1 2; 3 1]
4×2 Array{Int64,2}:
 2  2
 2  2
 1  2
 3  1

julia> unique(a,dims=1)
3×2 Array{Int64,2}:
 2  2
 1  2
 3  1

在

之前，你不需要

，

x=Integer.（舍入（10*rand（1000,4））

的工作原理是一样的。当我切换到你的方法时，我的速度大约降低了

10%

<代码>@时间为1:10000；整数。（四舍五入。（10.*兰特（1000,4））；结束和

0.991773秒（285.27 k分配：622.844 MiB，5.71%gc时间）

与

@时间之比为1:10000；整数。（四舍五入。（10*兰特（1000,4））；结束

和

1.073937秒（305.08 k分配：928.775 MiB，8.10%gc时间）

这很有趣！：-）@fsmart您是否真的按照@Gnimuc的回答详细定义了

unique2

？你能加上它吗？我想对其进行基准测试，也许您也希望使用

x=rand（1:101000,4）

。由于您所做的计算速度较慢且不均匀，并且可能与预期的范围不同（0,1,2…10）。我非常喜欢此解决方案

foldl

是我不知道的。一个小评论是DataFrames有一个

Ununique

函数来实现这一点-事实上，DataFrame行的

unique

是使用该

Ununique

函数实现的。因此，如果这是公共代码，那么它可能比重复的

更好。
duplicated(x) = foldl(
  (d,y)->(x[y,:] in d[1] ? (d[1],push!(d[2],y)) : (push!(d[1],x[y,:]),d[2])), 
  (Set(), Vector{Int}()), 
  1:size(x,1))[2]

julia> x = rand(1:2,5,2)
5×2 Array{Int64,2}:
 2  1
 1  2
 1  2
 1  1
 1  1

julia> duplicated(x)
2-element Array{Int64,1}:
 3
 5

julia> x[duplicated(x),:]
2×2 Array{Int64,2}:
 1  2
 1  1

julia> a=[2 2 ; 2 2; 1 2; 3 1]
4×2 Array{Int64,2}:
 2  2
 2  2
 1  2
 3  1

julia> unique(a,dims=1)
3×2 Array{Int64,2}:
 2  2
 1  2
 3  1