Python 将numpy.searchsorted方法应用于使用numpy.loadtxt从textfile加载的数组_Python_Arrays_Numpy_Large Data

Python 将numpy.searchsorted方法应用于使用numpy.loadtxt从textfile加载的数组

python arrays numpy

Python 将numpy.searchsorted方法应用于使用numpy.loadtxt从textfile加载的数组,python,arrays,numpy,large-data,Python,Arrays,Numpy,Large Data,我目前正在从事一个生物信息学项目，我需要解决以下问题我有一个文本文件“chr1.txt”，包含两列：染色体上的位置和布尔变量True或False 0错误 10000对 10001对 10005错误 10007正确 10011错误 10013正确 10017错误 10019错误 10023错误 10025正确 10029正确 10031错误 10035正确 10037错误 .. 此数据表示从0到10000的区域是重复的或（=不可映射-->假），从10000到10005的区域是唯一的（=可映射--

我目前正在从事一个生物信息学项目，我需要解决以下问题

我有一个文本文件“chr1.txt”，包含两列：染色体上的位置和布尔变量True或False

0错误
10000对
10001对
10005错误
10007正确
10011错误
10013正确
10017错误
10019错误
10023错误
10025正确
10029正确
10031错误
10035正确
10037错误
..
此数据表示从0到10000的区域是重复的或（=不可映射-->假），从10000到10005的区域是唯一的（=可映射-->真），从10005到10007的区域是重复的，依此类推。该文件在248'946'406位置结束，有15'948'271行。为了找到问题的一般解决方案，我想将文件限制在您可以在上面看到的行内

我想把这个文本文件加载到一个由两列组成的numpy数组中。为此，我使用了numpy.loadtxt：

import numpy as np    
with open('chr1.txt','r') as f:
        chr1 = np.loadtxt(f, dtype={'names':('start','mappable'),
        'formats':('i4','S1')})

以下是输出：

In [39]: chr1
Out[39]: 
array([(0, b'f'), (10000, b't'), (10001, b't'), (10005, b'f'),
       (10007, b't'), (10011, b'f'), (10013, b't'), (10017, b'f'),
       (10019, b'f'), (10023, b'f'), (10025, b't'), (10029, b't'),
       (10031, b'f'), (10035, b't'), (10037, b'f')], 
      dtype=[('position start', '<i4'), ('mappable', 'S1')])

现在我想对数组的第一列应用numpy.searchsorted方法，以确定我的基因组是否在该位置唯一可映射。所以，在本例中，我想要的输出是5（数组中元素（10011，b'f'）的索引）。如果我试图提取仅包含第一列位置的数组，则会出现错误：

In [21]: chr1[:,0]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-21-a63d052f1c5d> in <module>()
----> 1 chr1[:,0]

IndexError: too many indices for array

那么，如何使用现有数组仅提取位置并对其应用searchsorted方法呢？我是否应该以不同的方式将文本文件加载到数组中，这样就有两列，第一列是整数类型，第二列是布尔类型

extracted_array=[0,10000,10001,10005,10007,10011,10013,10017,10019,10023,10025,10029,10031,10035,10037]
np.searchsorted(extracted_array,10012)-1
Out[58]: 5

然后，我将使用找到的索引查看第二个参数是真是假，如果位置在可映射区域内，我将能够得出结论

非常感谢你的帮助

我们可以使用

chr1['position start']

提取与

位置开始

对应的数据，类似地，第二个字段也是如此。通过与

't'

的比较，我们将得到有效值的布尔数组

因此，我们会有一种方法，就像这样-

indx = chr1['position start']
mask = chr1['mappable']=='t'
rand_num = np.random.randint(10000,10037)
matched_indx = np.searchsorted(indx, rand_num)-1

if mask[matched_indx]:
    print "It is mappable!"
else:
    print "It is NOT mappable!"

1）获取数据和掩码/布尔数组-

In [283]: chr1   # Input array
Out[283]: 
array([(    0, 'f'), (10000, 't'), (10001, 't'), (10005, 'f'),
       (10007, 't'), (10011, 'f'), (10013, 't'), (10017, 'f'),
       (10019, 'f'), (10023, 'f'), (10025, 't'), (10029, 't'),
       (10031, 'f'), (10035, 't'), (10037, 'f')], 
      dtype=[('position start', '<i4'), ('mappable', 'S1')])

In [284]: indx = chr1['position start']
     ...: mask = chr1['mappable']=='t'
     ...: 

In [285]: indx
Out[285]: 
array([    0, 10000, 10001, 10005, 10007, 10011, 10013, 10017, 10019,
       10023, 10025, 10029, 10031, 10035, 10037], dtype=int32)

In [286]: mask
Out[286]: 
array([False,  True,  True, False,  True, False,  True, False, False,
       False,  True,  True, False,  True, False], dtype=bool)

非常感谢。这是令人惊讶的：）我也会在我的大文件上测试它，然后再次回来接受答案！

indx = chr1['position start']
mask = chr1['mappable']=='t'
rand_num = np.random.randint(10000,10037)
matched_indx = np.searchsorted(indx, rand_num)-1

if mask[matched_indx]:
    print "It is mappable!"
else:
    print "It is NOT mappable!"

In [283]: chr1   # Input array
Out[283]: 
array([(    0, 'f'), (10000, 't'), (10001, 't'), (10005, 'f'),
       (10007, 't'), (10011, 'f'), (10013, 't'), (10017, 'f'),
       (10019, 'f'), (10023, 'f'), (10025, 't'), (10029, 't'),
       (10031, 'f'), (10035, 't'), (10037, 'f')], 
      dtype=[('position start', '<i4'), ('mappable', 'S1')])

In [284]: indx = chr1['position start']
     ...: mask = chr1['mappable']=='t'
     ...: 

In [285]: indx
Out[285]: 
array([    0, 10000, 10001, 10005, 10007, 10011, 10013, 10017, 10019,
       10023, 10025, 10029, 10031, 10035, 10037], dtype=int32)

In [286]: mask
Out[286]: 
array([False,  True,  True, False,  True, False,  True, False, False,
       False,  True,  True, False,  True, False], dtype=bool)

In [297]: rand_num = 10012 # np.random.randint(10000,10037)

In [298]: matched_indx = np.searchsorted(indx, rand_num)-1

In [299]: matched_indx
Out[299]: 5

In [300]: if mask[matched_indx]:
     ...:     print "It is mappable!"
     ...: else:
     ...:     print "It is NOT mappable!"
     ...:     
It is NOT mappable!