Python 查找重叠或完全嵌套的范围并标记它们
如果连续的Python 查找重叠或完全嵌套的范围并标记它们,python,pandas,python-2.7,Python,Pandas,Python 2.7,如果连续的start-stop对的chr相同,我希望在按最小start排序到最大start后,在连续的start-stop范围中找到重叠或完全嵌套的范围 输出应如下所示: 到目前为止,我已经: import pandas as pd df = pd.DataFrame({'region_name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], 'start' : [1913, 46430576, 52899183, 58456122, 62925929,
start-stop
对的chr
相同,我希望在按最小start
排序到最大start
后,在连续的start-stop
范围中找到重叠或完全嵌套的范围
输出应如下所示:
到目前为止,我已经:
import pandas as pd
df = pd.DataFrame({'region_name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], 'start' : [1913, 46430576, 52899183, 58456122, 62925929, 65313395, 65511483, 65957829], 'stop' : [90207973, 90088654, 90088654, 74708723, 84585795, 90081985, 90096995, 83611443], 'chr':[1, 1, 1, 1, 1, 1, 1, 2]})
df=df.sort_值(按=['chr','start'],升序=[True,True])
对于范围内的i(1,len(df['region\u name']):
如果df['critical_error'][i]==True:
持续
对于范围(0,i)内的j:
如果df['start'][i]我没有得到零件:
。。。如果连续启动-停止对的chr相同
我仍然编写了一些代码,这些代码在某些方面与给定的表相同。如果你澄清你的观点,我可能会更新你的答案。也许它已经对您有所帮助,您可以将缺失的部分放入:
df = df.sort_values(by=['chr', 'start'], ascending=[True, True])
for i in range(1,len(df['region_name'])):
if df['critical_error'][i] == True:
continue
for j in range(0,i):
if df['start'][i] <= df['stop'][j] and df['stop'][i] <= df['stop'][j] and df['chr'][i] == df['chr'][j]:
df['overlap'][i] ='no overlap, nested with region %s' % df['region_name'][j]
break
elif df['start'][i] < df['stop'][j] and df['chr'][i] == df['chr'][j]:
df['overlap'][i] = 'overlap within region ' + df['region_name'][j]
else:
continue
你好这不是一个python实现,但我注意到你有基因组数据,实际上是一个非常强大、高效的工具,可以做你想要的事情。然后,您可以使用一个快速脚本进行后续操作。您好,感谢迄今为止的代码。通过相同的chr,我的意思是,对于最后一行(最后一个开始-停止对),chr是2,而对的其余部分是1,因此与其余部分没有重叠或嵌套,这是您在回答中输出的)。好的,当且仅当代码也具有相同的chr字段时,代码才会计算嵌套/重叠。如果那是你想要的,那就去吧。如果您还有任何问题,请随时提问。对不起,我无法复制您的解决方案。我在nested=((start>df.loc[:,'start'])和(stop无效的类型比较
错误。
。我还编辑了原始问题中的df
,以包括region\u name
列。您必须适应更改后的数据帧,然后:开始、停止、ch=row[:3]必须更改为新布局。是的,星号属于那里。但是您似乎使用了Python2。您可以使用以下模式替换这两个表达式:overlap\u index=[x代表zip中的x(范围(len(overlap)),如果x[1]]
import pandas as pd
import numpy as np
df = pd.DataFrame({'region_name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], 'start' : [1913, 46430576, 52899183, 58456122, 62925929, 65313395, 65511483, 65957829], 'stop' : [90207973, 90088654, 90088654, 74708723, 84585795, 90081985, 90096995, 83611443], 'chr':[1, 1, 1, 1, 1, 1, 1, 2]})
# store texts for each row in that list
overlaps_texts = []
# iterate over all rows
for i, row in df.iterrows():
# extract entries' data
start, stop, ch = row[1:4]
# Check if I am completely inside (nested into something)
# Note that this will always return indexers where each entry if True or False
# So nested will be something like [False, False, True, ...] where True means
# that start > start_other AND stop < stop_other (="I am nested")
nested = ((start > df.loc[:, 'start']) & (stop < df.loc[:, 'stop']))
# hanging out left
overlap_1 = ((stop > df.loc[:, 'start']) &
(stop < df.loc[:, 'stop'])
)
# starting before stop of other but ending after (hanging out right)
overlap_2 = ((start < df.loc[:, 'stop']) & (start > df.loc[:, 'start']))
# one of both overlaps good
overlap = (overlap_1 | overlap_2) & ~nested
# identical chr? I didnt get that part. That may be different for your application
overlap &= df.loc[:, 'chr'] == ch
nested &= df.loc[:, 'chr'] == ch
# generate text
text = ''
# check if any nestings
if np.any(nested):
nested_indices = [*filter(lambda x: x[1], zip(range(len(nested)), nested))]
text = "I am nested within: "
region_names = []
for index, _ in nested_indices:
region_names.append(df.iloc[index,0])
text += ", ".join(region_names)+"; "
# check if any overlaps (obviously one can write that more DRY), since it repeats the pattern from above
if np.any(overlap):
overlap_indices = [*filter(lambda x: x[1], zip(range(len(overlap)), overlap))]
text += "I overlap: "
region_names = []
for index, _ in overlap_indices:
region_names.append(df.iloc[index,0])
text += ", ".join(region_names)
if text == '':
text = 'I am not nested nor do I overlap something'
overlaps_texts.append(text)
df.loc[:, 'overlap'] = overlaps_texts
print(df)
start ... overlap
0 1913 ... I am not nested nor do I overlap something
1 46430576 ... I am nested within: A; I overlap: G
2 52899183 ... I am nested within: A; I overlap: B, G
3 58456122 ... I am nested within: A, B, C; I overlap: E, F, G
4 62925929 ... I am nested within: A, B, C; I overlap: D, F, G
5 65313395 ... I am nested within: A, B, C; I overlap: D, E, G
6 65511483 ... I am nested within: A; I overlap: B, C, D, E, F
7 65957829 ... I am not nested nor do I overlap something