Python 有效地删除不同行之间包含重复元素的行_Python_Arrays_Numpy_Duplicates_Rows

Python 有效地删除不同行之间包含重复元素的行

python arrays numpy

Python 有效地删除不同行之间包含重复元素的行,python,arrays,numpy,duplicates,rows,Python,Arrays,Numpy,Duplicates,Rows,给定一个2D数组，我可能在索引I处有一行，在索引j处的另一行中可以找到一个或多个数字。我需要从数组中删除那些行I和j。此外，在任何行中，数字对该行始终是唯一的。我的解决方案已经没有循环，基于Numpy。以下是我提出的唯一解决方案： def filter_array(arr): # Reshape to 1D without hard copy arr_1d = arr.ravel() # Make a count of only the existing number

给定一个2D数组，我可能在索引I处有一行，在索引j处的另一行中可以找到一个或多个数字。我需要从数组中删除那些行I和j。此外，在任何行中，数字对该行始终是唯一的。我的解决方案已经没有循环，基于Numpy。以下是我提出的唯一解决方案：

def filter_array(arr):
    # Reshape to 1D without hard copy
    arr_1d = arr.ravel()
    # Make a count of only the existing numbers (faster than histogram)
    u_elem, c = np.unique(arr_1d, return_counts=True)
    # Get which elements are duplicates.
    duplicates = u_elem[c > 1]
    # Get the rows where these duplicates belong
    dup_idx = np.concatenate([np.where(arr_1d == d)[0] for d in duplicates])
    dup_rows = np.unique(dup_idx //9)
    # Remove the rows from the array
    b = np.delete(arr, dup_rows, axis=0)
    return b

以下是输入数组的一个（过于简化）示例：

a = np.array([
    [1, 3, 23, 40, 33],
    [2, 8, 5, 35, 7],
    [9, 32, 4, 6, 3],
    [72, 85, 32, 48, 53],
    [3, 98, 101, 589, 208],
    [343, 3223, 4043, 65, 78]
])

过滤后的数组给出了预期的结果，尽管我没有详细检查这是否在所有可能的情况下都有效：

[[   2    8    5   35    7]
 [ 343 3223 4043   65   78]]

我的典型数组大小约为10^5到10^6行，固定数量为9列。%timeit给出大约270毫秒来过滤每个这样的阵列。我有一亿个。在考虑其他方法（例如GPU）之前，我正在尝试在单cpu上加快速度

这些数据可能已经存在于熊猫数据框中

通过在找到唯一值及其计数后使用，并使用结果对数组进行索引，我们可以实现显著的加速：

u, c = np.unique(a, return_counts=True)
a[np.isin(a, u[c == 1]).all(1)]

array([[   2,    8,    5,   35,    7],
       [ 343, 3223, 4043,   65,   78]])

时间：

def filter_array(arr):
    arr_1d = arr.ravel()
    u_elem, c = np.unique(arr_1d, return_counts=True)
    duplicates = u_elem[c > 1]
    dup_idx = np.concatenate([np.where(arr_1d == d)[0] for d in duplicates])
    dup_rows = np.unique(dup_idx //9)
    b = np.delete(arr, dup_rows, axis=0)
    return b

def yatu(arr):
    u, c = np.unique(arr, return_counts=True)
    return arr[np.isin(arr, u[c == 1]).all(1)]

a_large = np.random.randint(0, 50_000, (10_000, 5))

%timeit filter_array(a_large)
# 433 ms ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit yatu(a_large)
# 7.81 ms ± 443 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

明亮的在我的实际数据上同样的速度！