Python 比较二维numpy阵列与一维numpy阵列
我有一个numpy数组Python 比较二维numpy阵列与一维numpy阵列,python,arrays,numpy,Python,Arrays,Numpy,我有一个numpy数组a的形状(m1,m2),带有字符串条目。我将此数组a的条目与包含字符串的一维numpy数组(arr)进行比较。一维数组arr的形状为(n,),其中n是一个大数字(~10000) 可以找到数组a的示例。可以找到阵列arr的示例 这就是我将arr与a中的行进行比较的方式。如果在a的任何行中找到arr的元素,则我将arr中该元素的索引保存在新列表中(comp+str(I).zfill(5)): 将熊猫作为pd导入 将numpy作为np导入 a=pd.read\u csv('fil
a
的形状(m1,m2)
,带有字符串条目。我将此数组a
的条目与包含字符串的一维numpy数组(arr
)进行比较。一维数组arr
的形状为(n,)
,其中n
是一个大数字(~10000)
可以找到数组a
的示例。可以找到阵列arr
的示例
这就是我将arr
与a
中的行进行比较的方式。如果在a
的任何行中找到arr
的元素,则我将arr
中该元素的索引保存在新列表中(comp+str(I).zfill(5)
):
将熊猫作为pd导入
将numpy作为np导入
a=pd.read\u csv('file.txt',error\u bad\u lines=False,sep=r'\s+',header=None)。value[:,1::].astype('我无法可靠地读取您的file.txt,因此用完了它的一小部分。
我将“arr”转换成一个字典,名为“lu”,其中文本作为键,索引位置作为值
In [132]: a=np.array([['onecut2', 'ttc14', 'zadh2', 'pygm', 'tiparp', 'mgat4a', 'man2a1', 'zswim5', 'tubd1', 'igf2bp3'],
...: ['pou2af1', 'slc25a12', 'zbtb25', 'unk', 'aif1', 'tmem54', 'apaf1', 'dok2', 'fam60a', 'rab4b'],
...: ['rara', 'kcnk4', 'gfer', 'trip10', 'cog6', 'srebf1', 'zgpat', 'rxrb', 'clcf1', 'fyttd1'],
...: ['rarb', 'kcnk4', 'gfer', 'trip10', 'cog6', 'srebf1', 'zgpat', 'rxrb', 'clcf1', 'fyttd1'],
...: ['rarg', 'kcnk4', 'gfer', 'trip10', 'cog6', 'srebf1', 'zgpat', 'rxrb', 'clcf1', 'fyttd1'],
...: ['pou5f1', 'slc25a12', 'zbtb25', 'unk', 'aif1', 'tmem54', 'apaf1', 'dok2', 'fam60a', 'rab4b'],
...: ['apc', 'rab34', 'lsm3', 'calm2', 'rbl1', 'gapdh', 'prkce', 'rrm1', 'irf4', 'actr1b']])
In [133]: def do_analysis(src, lu):
...: res={} # Initialise result to an empty dictionary
...: for r, row in enumerate(src):
...: temp_list=[] # list to append results to in the inner loop
...: for txt in row:
...: exists=lu.get(txt, -1) # lu returns the index of txt in arr, or -1 if not found.
...: if exists>=0: temp_list.append(exists) # If txt was found in a append it's index to the temp_list
...: res['comp'+str(r).zfill(5)]=temp_list
...: # Once analysis of the row has finished store the list in the res dictionary
...: return res
In [134]: lu=dict(zip(arr, range(len(arr))))
# Turn the array 'arr' into a dictionary which returns the index of the corresponding text.
In [135]: lu
Out[135]:
{'pycrl': 0, 'gpr180': 1, 'gpr182': 2, 'gpr183': 3, 'neurl2': 4,
...
'hcn2': 999, ...}
In [136]: do_analysis(a, lu)
Out[136]:
{'comp00000': [6555, 3682, 7282, 1868, 5522, 9128, 1674, 8695, 156],
'comp00001': [6006, 3846, 8185, 8713, 5806, 4912, 597, 7565, 3003],
'comp00002': [9355, 3717, 1654, 2386, 6309, 7396, 3825, 2135, 6596, 7256],
'comp00003': [9356, 3717, 1654, 2386, 6309, 7396, 3825, 2135, 6596, 7256],
'comp00004': [9358, 3717, 1654, 2386, 6309, 7396, 3825, 2135, 6596, 7256],
'comp00005': [6006, 3846, 8185, 8713, 5806, 4912, 597, 7565, 3003],
'comp00006': [8916, 8588, 2419, 3656, 9015, 7045, 7628, 5519, 8793, 1946]}
In [137]: %timeit do_analysis(a, lu)
47.9 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Your file 262000 bytes
My array a 462 bytes
48 µs
In [138]: 262000 / 462 * 48 / 1000000
Out[138]: 0.0272 seconds
如果数组“a”是列表列表,则分析的运行速度是“a”是numpy数组时的两倍
我希望这能满足您的需要,或者为您指明正确的方向。添加最少的代表性数据?您希望进行什么比较?对于初学者,您可以计算集合(a[I,:])
和全局()['comp'+str(I).zfill(5)]
的每个i
值只有一次,而不是每个j
值都有一次。@Divakar:我现在添加了代表性数据。=@kevinkayaks:在a
的每一行中,我想找出arr
的元素是否存在,如果存在,那么我想找到arr
元素的索引。
In [132]: a=np.array([['onecut2', 'ttc14', 'zadh2', 'pygm', 'tiparp', 'mgat4a', 'man2a1', 'zswim5', 'tubd1', 'igf2bp3'],
...: ['pou2af1', 'slc25a12', 'zbtb25', 'unk', 'aif1', 'tmem54', 'apaf1', 'dok2', 'fam60a', 'rab4b'],
...: ['rara', 'kcnk4', 'gfer', 'trip10', 'cog6', 'srebf1', 'zgpat', 'rxrb', 'clcf1', 'fyttd1'],
...: ['rarb', 'kcnk4', 'gfer', 'trip10', 'cog6', 'srebf1', 'zgpat', 'rxrb', 'clcf1', 'fyttd1'],
...: ['rarg', 'kcnk4', 'gfer', 'trip10', 'cog6', 'srebf1', 'zgpat', 'rxrb', 'clcf1', 'fyttd1'],
...: ['pou5f1', 'slc25a12', 'zbtb25', 'unk', 'aif1', 'tmem54', 'apaf1', 'dok2', 'fam60a', 'rab4b'],
...: ['apc', 'rab34', 'lsm3', 'calm2', 'rbl1', 'gapdh', 'prkce', 'rrm1', 'irf4', 'actr1b']])
In [133]: def do_analysis(src, lu):
...: res={} # Initialise result to an empty dictionary
...: for r, row in enumerate(src):
...: temp_list=[] # list to append results to in the inner loop
...: for txt in row:
...: exists=lu.get(txt, -1) # lu returns the index of txt in arr, or -1 if not found.
...: if exists>=0: temp_list.append(exists) # If txt was found in a append it's index to the temp_list
...: res['comp'+str(r).zfill(5)]=temp_list
...: # Once analysis of the row has finished store the list in the res dictionary
...: return res
In [134]: lu=dict(zip(arr, range(len(arr))))
# Turn the array 'arr' into a dictionary which returns the index of the corresponding text.
In [135]: lu
Out[135]:
{'pycrl': 0, 'gpr180': 1, 'gpr182': 2, 'gpr183': 3, 'neurl2': 4,
...
'hcn2': 999, ...}
In [136]: do_analysis(a, lu)
Out[136]:
{'comp00000': [6555, 3682, 7282, 1868, 5522, 9128, 1674, 8695, 156],
'comp00001': [6006, 3846, 8185, 8713, 5806, 4912, 597, 7565, 3003],
'comp00002': [9355, 3717, 1654, 2386, 6309, 7396, 3825, 2135, 6596, 7256],
'comp00003': [9356, 3717, 1654, 2386, 6309, 7396, 3825, 2135, 6596, 7256],
'comp00004': [9358, 3717, 1654, 2386, 6309, 7396, 3825, 2135, 6596, 7256],
'comp00005': [6006, 3846, 8185, 8713, 5806, 4912, 597, 7565, 3003],
'comp00006': [8916, 8588, 2419, 3656, 9015, 7045, 7628, 5519, 8793, 1946]}
In [137]: %timeit do_analysis(a, lu)
47.9 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Your file 262000 bytes
My array a 462 bytes
48 µs
In [138]: 262000 / 462 * 48 / 1000000
Out[138]: 0.0272 seconds