Python:迭代.txt文件以提取数据以匹配您的条件

Python:迭代.txt文件以提取数据以匹配您的条件,python,loops,iteration,Python,Loops,Iteration,我有一个示例inputfile.txt: chr1 34870071 34899867 pi-Fam168b.1 - chr11 98724946 98764609 pi-Wipf2.1 + chr11 105898192 105920636 pi-Dcaf7.1 + chr11 120486441 120495268 pi-Mafg.1 - chr12 3891106 3914443 pi-Dnmt3a.1 + chr

我有一个示例inputfile.txt:

chr1    34870071    34899867    pi-Fam168b.1    -
chr11   98724946    98764609    pi-Wipf2.1  +
chr11   105898192   105920636   pi-Dcaf7.1  +
chr11   120486441   120495268   pi-Mafg.1   -
chr12   3891106 3914443 pi-Dnmt3a.1 +
chr12   82815946    82882157    pi-Map3k9.1 -
chr13   23855536    23856215    pi-Hist1h1a.1   +
chr13   55206682    55236190    pi-Zfp346.1 +
chr1    95700553    95718679    pi-Ing5.1   +
chr13   55313417    55419685    pi-Nsd1.1   +
chr14   27852218    27920472    pi-Il17rd.1 +
chr14   65430438    65568699    pi-Hmbox1.1 -
chr1    120524521   120581739   pi-Tfcp2l1.1    +
chr15   81633147    81657289    pi-Tef.1    +
chr15   89331804    89390691    pi-Shank3.1 +
chr15   103021983   103070259   pi-Cbx5.1   -
chr16   16896549    16927451    pi-Ppm1f.1  +
chr16   17233679    17263523    pi-Hic2.1   +
chr16   17452059    17486929    pi-Crkl.1   +
chr16   24393531    24992661    pi-Lpp.1    +
chr16   43964878    43979143    pi-Zdhhc23.1    -
chr17   25098236    25152532    pi-Cramp1l.1    -
chr17   27993451    28036985    pi-Uhrf1bp1.1   +
chr17   83973363    84031786    pi-Kcng3.1  -
chr1    133904194   133928161   pi-Elk4.1   +
chr18   60844148    60908308    pi-Ndst1.1  -
chr19   10057193    10059582    pi-Fth1.1   +
chr19   44637337    44650762    pi-Hif1an.1 +
chr1    135027714   135036359   pi-Ppp1r15b.1   +
chr2    28677821    28695861    pi-Gtf3c4.1 -
chr1    136651241   136852527   pi-Ppp1r12b.1   -
chr2    154262219   154365092   pi-Cbfa2t2.1    +
chr2    156022393   156135687   pi-Phf20.1  +
chr3    51028854    51055547    pi-Ccrn4l.1 +
chr3    94985683    95021902    pi-Gabpb2.1 -
chr1    158488203   158579750   pi-Abl2.1   +
chr4    45411294    45421633    pi-Mcart1.1 -
chr4    56879897    56960355    pi-D730040F13Rik.1  -
chr4    59818521    59917612    pi-Snx30.1  +
chr4    107847846   107890527   pi-Zyg11a.1 -
chr4    107900359   107973695   pi-Zyg11b.1 -
chr4    132195002   132280676   pi-Eya3.1   +
chr4    134968222   134989706   pi-Rcan3.1  -
chr4    136025678   136110697   pi-Luzp1.1  +
chr1    162933052   162964958   pi-Zbtb37.1 -
chr5    38591490    38611628    pi-Zbtb49.1 -
chr5    67783388    67819359    pi-Bend4.1  -
chr5    114387108   114443767   pi-Ssh1.1   -
chr5    115592990   115608225   pi-Mlec.1   -
chr5    143628624   143656891   pi-Fbxl18.1 -
chr1    172123561   172145541   pi-Uhmk1.1  -
chr6    83312367    83391602    pi-Tet3.1   -
chr6    85419571    85434653    pi-Fbxo41.1 -
chr6    116288039   116359551   pi-March08.1    +
chr6    120786229   120842859   pi-Bcl2l13.1    +
chr7    71031236    71083761    pi-Klf13.1  -
chr7    107068766   107128968   pi-Rnf169.1 -
chr7    139903770   140044311   pi-Fam53b.1 -
chr8    72285224    72298794    pi-Zfp866.1 -
chr8    106872110   106919708   pi-Cmtm4.1  -
chr8    112250549   112261649   pi-Atxn1l.1 -
chr10   41901651    41911816    pi-Foxo3.1  -
chr8    119682164   119739895   pi-Gan.1    +
chr8    125406988   125566154   pi-Ankrd11.1    -
chr9    27148219    27165314    pi-Igsf9b.1 +
chr9    44100521    44113717    pi-Hinfp.1  -
chr9    61761092    61762348    pi-Rplp1.1  -
chr9    106590412   106691503   pi-Rad54l2.1    -
chr9    114416339   114473487   pi-Trim71.1 -
chr9    119311403   119351032   pi-Acvr2b.1 +
chr9    119354082   119373348   pi-Exog.1   +
chr10   82822985    82831579    pi-D10Wsu102e.1 +
chr10   126415753   126437016   pi-Ctdsp2.1 +
chr1    90159688    90174093    pi-Hjurp.1  -
chr11   60591039    60597792    pi-Smcr8.1  +
chr11   69209318    69210176    pi-Lsmd1.1  +
chr11   75345218    75391069    pi-Slc43a2.1    +
chr11   79474214    79511524    pi-Rab11fip4.1  +
chr11   95818479    95868022    pi-Igf2bp1.1    -
chr11   97223641    97259855    pi-Socs7.1  +
chr11   97524530    97546757    pi-Mllt6.1  +
chr1    120355721   120355843   1-qE2.3-2.1 -
chr2    120518324   120540873   2-qE5-4.1   +
chr7    82913927    82926993    7-qD2-40.1  -
第1列=染色体数目

第2列=开始

第3列=结束

第4列=基因名称

第5列=方向(+或-)

1.)我需要提取染色体数目相同的株系(第1列),它们的起始位点最大相差200个(即200个或更少)(第2列),它们的方向相反(一个是正/负)

这就是我目前的情况,我不确定我的错误在哪里:

import csv
import itertools as it
f=open('inputfile.txt', 'r')

def getrecords(f):
    for line in open(f):
        yield line.strip().split()
key=lambda x: x[0]
for i, rec in it.groupby(sorted(getrecords('inputfile.txt'), key=key), key=key):
    for c0, c1 in it.combinations(rec, 2):
        if (c0[4]!= c1[4] and (abs(int(c0[1])-int(c1[1]))) < 200):
            print ("%s\t%s\t%s" % (c0[0], c0[1], c0[3]))
            print("%s\t%s\t%s" % (c1[0], c1[1], c1[3]))
导入csv
按原样导入itertools
f=打开('inputfile.txt','r')
def getrecords(f):
对于处于开放状态的线路(f):
屈服线.strip().split()
密钥=λx:x[0]
对于i,it.groupby中的rec(已排序(getrecords('inputfile.txt'),key=key),key=key):
对于it组合(rec,2)中的c0和c1:
如果(c0[4]!=c1[4]和(abs(int(c0[1])-int(c1[1]))<200):
打印(“%s\t%s\t%s”%(c0[0],c0[1],c0[3]))
打印(“%s\t%s\t%s”%(c1[0],c1[1],c1[3]))
请注意:此代码运行,但我需要考虑负('-')方向**的结束站点(第3列),换句话说,比较时,如果起始站点具有“+”方向,如果它具有否定方向,则比较起始站点。如何编辑代码以满足所有条件

我预计大约有15条独特的序列线


然后我会对这些行进行排序,以消除重复的行

检查“相同染色体数”、“起始位点差异小于等于200”和“相反方向”是否正确

我为起始站点diff添加了一个print语句,发现没有一个diff值接近200。他们中的大多数人都有数百万。从这个示例文件中,您知道希望打印哪些文件吗


对于方向,我不明白你说的开始和结束有不同的方向是什么意思,因为每一行只有一个方向。

如果文本文件中的标题位于所有列名称上方,例如:

chromosome_number    start    end    gene_name    Orientation
突然,您安装了软件包,您可以使用代码提取必要的值:

import pandas
import itertools

# delim_whitespace: Parse whitespace-delimited (spaces or tabs) file (much faster than using a regular expression)
data = pandas.read_table('inputfile.txt', delim_whitespace=True)
# group by chromosome_number
for name, group in data.groupby('chromosome_number'):
    # check differences of start site value between each other
    for a, b in itertools.combinations(group['start'], 2):
        # if difference <= 1000000
        if (abs(a - b) <= 1000000):
            # if orientations are opposite
            if (group.loc[group['start'] == a]['Orientation'].iloc[0] != group.loc[group['start'] == b]['Orientation'].iloc[0]):
                print(group.loc[group['start'] == a])
                print(group.loc[group['start'] == b])

如果您对问题进行编辑,使其包含给定数据的一些示例输出,这会有所帮助。我认为您的代码只适用于文件。问题在于你的数据。没有差值小于等于200的组合。
   chromosome_number      start        end     gene_name Orientation
12              chr1  120524521  120581739  pi-Tfcp2l1.1           +
   chromosome_number      start        end    gene_name Orientation
81              chr1  120355721  120355843  1-qE2.3-2.1           -