Python 查找重叠的列/行集_Python_Numpy

Python 查找重叠的列/行集

python numpy

Python 查找重叠的列/行集,python,numpy,Python,Numpy,背景：这个问题使问题更进一步假设我有一个2D数组，其中列被划分成几个集合。为简单起见，我们可以假设数组包含int值，如下所示： np.random.randint(3,size=(2,10)) # Column indices: # 0 1 2 3 4 5 6 7 8 9 array([[0, 2, 2, 2, 1, 1, 0, 1, 1, 2], [1, 1, 0, 1, 1, 0, 2, 1, 1

背景：这个问题使问题更进一步

假设我有一个2D数组，其中列被划分成几个集合。为简单起见，我们可以假设数组包含

int

值，如下所示：

np.random.randint(3,size=(2,10))   

# Column indices:
#       0  1  2  3  4  5  6  7  8  9                     
array([[0, 2, 2, 2, 1, 1, 0, 1, 1, 2],
       [1, 1, 0, 1, 1, 0, 2, 1, 1, 0]])

# Partitioning the column indices of the previous array:

my_partition['first']  = [0,1,2]
my_partition['second'] = [3,4]
my_partition['third']  = [5,6,7]
my_partition['fourth'] = [8, 9]

作为列索引分区的示例，我们可以选择以下选项：

np.random.randint(3,size=(2,10))   

# Column indices:
#       0  1  2  3  4  5  6  7  8  9                     
array([[0, 2, 2, 2, 1, 1, 0, 1, 1, 2],
       [1, 1, 0, 1, 1, 0, 2, 1, 1, 0]])

# Partitioning the column indices of the previous array:

my_partition['first']  = [0,1,2]
my_partition['second'] = [3,4]
my_partition['third']  = [5,6,7]
my_partition['fourth'] = [8, 9]

我希望找到具有相同值的列的列索引集组。在上述示例中，这些组的一些示例如下：

# The following sets include indices for a common column vector with values [2,0]^T
group['a'] = ['first', 'fourth'] 

# The following sets include indices for a common column vector with values [1,1]^T
group['b'] = ['second', 'third', 'fourth']

我对这个问题的解决方案感兴趣，该解决方案适用于包含实数值的数组（例如
1.0/2
和
1.0/2
的值是相同的，即
1.0/2==1.0/2
返回
True
）
我知道浮动精度的潜在局限性，因此为了简单起见，我将分两步处理此问题：

如果值相同，则假定两列相同

假设两列值彼此接近时相同（例如向量差低于阈值）

我试图在前一个线程中概括这个解决方案，但我不确定它是否直接适用。我认为它可以解决第一个问题（列中的值完全相同），但第二个问题可能需要“更大的船”。
如果您想从列集合中创建一个集样式的数据结构，这里有一种方法（我相信有更有效的方法来处理更大的数据）：
针对
数组的示例执行
给出：

In [132]: group Out[132]: {(0, 1): [0], (0, 2): [6], (1, 0): [5], (1, 1): [4, 7, 8], (2, 0): [2, 9], (2, 1): [1, 3]}
由于
numpy.ndarray
是不可散列的（就像
list
），因此列本身不能充当
dict
键。我选择只使用与该列等价的
元组，但还有许多其他选择此外，我假设组中需要列索引的列表。如果这是真的，您可以考虑使用<代码> Debug TDC/<代码>，而不是常规的代码> DICT。但您也可以使用许多其他容器来存储列索引已更新我相信我能更好地理解这个问题：给定一个预定义列组的任意集合，如何确定任意两个给定组是否包含一个公共列如果我们假设您已经在我上面的回答中构建了类似集合的结构，您可以将这两个组作为一组，查看它们的组成列，并询问是否有任何列最终位于集合字典的同一部分：假设我们定义： my_partition['first'] = [0,1,2] my_partition['second'] = [3,4] my_partition['third'] = [5,6,7] my_partition['fourth'] = [8, 9] # Define a helper to back-out the column that serves as a key for the set-like structure. # Take 0th element, column index should only be part of one subset. get_key = lambda x: [k for k,v in group.iteritems() if x in v][0] # use itertools import itertools # Print out the common columns between each pair of groups. for pair_x, pair_y in itertools.combinations(my_partition.keys(), 2): print pair_x, pair_y, (set(map(get_key, my_partition[pair_x])) & set(map(get_key, my_partition[pair_y]))) 只要这不是空集，就意味着两个组之间的某些列是相同的针对您的问题执行： In [163]: for pair_x, pair_y in itertools.combinations(my_partition.keys(), 2): print pair_x, pair_y, set(map(get_key, my_partition[pair_x])) & set(map(get_key, my_partition[pair_y])) .....: second fourth set([(1, 1)]) second third set([(1, 1)]) second first set([(2, 1)]) fourth third set([(1, 1)]) fourth first set([(2, 0)]) third first set([]) 你的描述令人困惑。我不明白第一列和第四列怎么会有[2,0]^t 。第一列看起来像[2,1]^T ，但第四列有[1,1]^T @EMS我添加了更多注释。如果还不清楚，请告诉我。浮点运算可能很棘手。。。假设您将相等的阈值设置为0.001。现在想象一下分区A 有项[2.0,0.0] ，分区B 有项[2.0009,0.0] ，分区C 有项[2.0011,0.0] 。是否要使用a 、B 和C 创建单个组？或者您愿意分为两组，A 和B ，以及B 和C ？是的，还不清楚。索引为1的列具有值[2,1]^T ，索引为4的列具有值[1,1]^T 。这使得“组[a]=['first'，fourth']”部分很难理解，因为这两列都没有值[2,0]^T 。您似乎在说，您得到了一组预定义的列。对于每对预定义的组，您需要知道这些组是否包含一个公共列。