Python 如果第1列匹配,则具有第11列值的提取行位于第二个文件的第2列和第3列之间

Python 如果第1列匹配,则具有第11列值的提取行位于第二个文件的第2列和第3列之间,python,Python,你好,我有两个文件 文件1: chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255 + - 20311210.00 chr5 26610220 26610295 5 26610221 26610296 K00230:40:HNWJLBBXX:4:1101:1022:24155 255 + - 26610258.00 chr5 20311169 20311244 5 20311177

你好,我有两个文件

文件1:

chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255 + - 20311210.00 chr5 26610220 26610295 5 26610221 26610296 K00230:40:HNWJLBBXX:4:1101:1022:24155 255 + - 26610258.00 chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255+-20311210.00 chr5 26610220 26610295 5 26610221 26610296 K00230:40:HNWJLBBXX:4:1101:1022:24155255+-26610258.00 文件2

chr5 20311200 20311220 Nucleosome:1 110 5.0 39.9 MainPeak 1.43492858 0.68583064 chr5 801 861 Nucleosome:2 70 1.0 5.4 MainPeak 0.17076187 0.806538035 chr5 1021 1091 Nucleosome:3 80 2.0 14.4 MainPeak 0.42430331 0.481579895 chr5 1181 1251 Nucleosome:4 80 1.0 7.5 MainPeak 0.1362587 0.32626102999999995 chr5 1361 1441 Nucleosome:5 90 2.0 14.7 MainPeak 0.34212933 0.291726595 chr5 1621 1801 Nucleosome:6 190 2.0 26.1 MainPeak:doublet 0.37546564 0.353192625 chr5 2011 2071 Nucleosome:7 70 1.0 5.7 MainPeak 0.15091517 0.396369735 chr5 2161 2331 Nucleosome:8 180 1.0 17.2 MainPeak 0.08865312 0.42133046500000004 chr5 2441 2561 Nucleosome:9 130 2.5 25.3 MainPeak 0.7368501 0.48843276 chr5 2781 2851 Nucleosome:10 80 3.0 17.5 MainPeak 0.80818501 1.303005 chr5 3271 3431 Nucleosome:11 170 3.0 34.5 MainPeak+Shoulder 0.72967697 1.348257495 chr5 3521 3571 Nucleosome:12 60 1.0 5.8 MainPeak 0.1880739 0.504429705 chr5 3641 3791 Nucleosome:13 160 1.0 12.5 MainPeak:doublet 0.10098579 0.363148215 chr5 20311200 20311220核小体:11105.039.9主峰1.43492858 0.68583064 chr5 801861核小体:270 1.0 5.4主峰0.17076187 0.806538035 chr5 1021 1091核小体:380 2.0 14.4主峰0.42430331 0.481579895 chr5 1181 1251核小体:480 1.0 7.5主峰0.1362587 0.3262610999995 chr5 1361 1441核小体:590 2.0 14.7主峰0.34212933 0.291726595 chr5 1621 1801核小体:61902.026.1主峰:双峰0.37546564 0.353192625 chr5 2011 2071核小体:770 1.0 5.7主峰0.15091517 0.396369735 chr5 2161 2331核小体:8 180 1.0 17.2主峰0.08865312 0.421330404650000004 chr5 2441 2561核小体:9 130 2.5 25.3主峰0.7368501 0.48843276 chr5 2781 2851核小体:10803.017.5主峰0.80818501 1.303005 chr5 3271 3431核小体:11 170 3.0 34.5主峰+肩峰0.72967697 1.348257495 chr5 3521 3571核小体:12 60 1.0 5.8主峰0.1880739 0.504429705 chr5 3641 3791核小体:13 160 1.0 12.5主峰:双峰0.10098579 0.363148215 如果第11列的值在seconds文件中声明的start和end(第2列和第3列)范围内,我对使用python代码打印文件1中的行感兴趣。由于该位置仅在特定染色体(chr)内是唯一的,因此必须首先测试chr是否相同。。。因此,我期望的输出是

chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255 + - 20311210.00 chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255+-20311210.00 我试过awk代码。。它工作得非常好,但速度非常慢

我正在测试的文件(我需要从中打印大约4GB的行)

如果我能有一些python代码,我将不胜感激


谢谢

一个从文本中提取第n列的简单函数使这一点变得相当简单。我假设你说的“第11列”是指从1开始计算的11列,而不是索引-11列,其中第一项是索引-0

伪代码:

Until there's no data left ~
    Read line1 from file1
    Read line2 from file2
    Extract Col11 from line1 as a real number
    Extract Col2 & Col3 from line2 as real numbers
    IF Col11 is within Col2 & Col3
        do something
Python代码:

import sys

# Given a space-separated row of data, return the Nth column as a real number
def getNthColumn(row, N):
    # Single-space the row, removing tabs, double-spaces etc.
    row = ' '.join(row.split())
    fields = row.split(' ')
    result = float(fields[N-1])   # fields are numbered 0->(N-1)
    #print("Returning column %d from [%s] -> %f" % (N, row, result))
    return result

if (len(sys.argv) == 3):
    fin1 = open(sys.argv[1], "rt")
    fin2 = open(sys.argv[2], "rt")  #TODO - handle file-not-found errors, etc.

    line1 = fin1.readline()
    line2 = fin2.readline()
    while (line1 != "" and line2 != ""):
        # Get the columns from the two lines
        f1_col11 = getNthColumn(line1, 11)
        f2_col2  = getNthColumn(line2,  2)
        f2_col3  = getNthColumn(line2,  3)  ### TODO handle errors
        # work out if it's a keeper
        # print("Is %f >= %f and %f <= %f" % (f1_col11, f2_col2, f1_col11, f2_col3))
        if (f1_col11 >= f2_col2 and f1_col11 <= f2_col3):
            print("MATCH: "+line1)
        else:
            print("NO-MATCH: "+line1)
        # Next rows
        line1 = fin1.readline()
        line2 = fin2.readline()
else:
    print("Give 2 files as arguments")
导入系统 #给定以空格分隔的数据行,将第n列作为实数返回 def GETNTH列(行,N): #单行空格、删除制表符、双空格等。 row=''.join(row.split()) 字段=行分割(“”) 结果=浮点(字段[N-1])#字段编号为0->(N-1) #打印(“从[%s]->%f”%返回列%d(N,行,结果)) 返回结果 如果(len(sys.argv)==3): fin1=打开(sys.argv[1],“rt”) fin2=open(sys.argv[2],“rt”)#TODO-handle file not found错误等。 line1=fin1.readline() line2=fin2.readline() 而(第1行!=”和第2行!=”): #从两行中获取列 f1\u col11=GetnTholumn(第1行,第11行) f2_col2=第n列(第2行,第2行) f2_col3=getNthColumn(第2行,第3行)####TODO处理错误 #看看是不是守门员 #打印(“是%f>=%f和%f%f”%(N,行,结果)) 返回结果 如果(len(sys.argv)==3): fin1=打开(sys.argv[1],“rt”) fin2=open(sys.argv[2],“rt”)#TODO-handle file not found错误等。 #加载整个file2,但只加载column2和column3 #注意最小col2和最大c3 line2=fin2.readline() 最小值c2=无 最大值c3=无 而(第2行!=“”): col2=getnth列(第2行,第2行) col3=getnth列(第2行,第3行) file2_col23.append((col2,col3)) #注意最小c2和最大c3,以便我们可以快速知道搜索是否可以 #可能产生结果 如果(最小值c2==无或col2<最小值c2): 最小值c2=col2 如果(max_c3==None或col3>max_c3): 最大值c3=col3 #下一行 line2=fin2.readline().strip() #对列进行排序,使我们能够快速搜索 file2_col23.sort() line1=fin1.readline() 而(第1行!=“”): col11=getNthColumn(第1行,第11行) 匹配=错误 #col11是否在任何file2行col2或col3中
如果(col11>=min_c2和col11=col2和col11欢迎使用StackOverflow。请阅读并遵循帮助文档中的发布指南,如您创建此帐户时所建议的,并在此处应用。StackOverflow不是设计、编码、研究或教程资源。但是,如果您遵循联机找到的任何资源,请嵌套编码尝试,并遇到问题,您将有一个很好的示例发布。您好,非常感谢您的回复!我已复制粘贴了您的代码,因为它位于名为test.py的文件中,并使用以下命令python test.py file1 file2运行它。但是,我得到了如下不匹配的输出:chr5 19776794 19776869 5 19776845 19776920 K00230:40:HNWJLBBXX:4:1127:18467:28112 255+-19776857.00但我确信我的输入文件有一半以上在该范围内(这里我只测试了chr5)。请帮助!我应该更改您发布的代码中的任何参数吗?很抱歉,我是python新手…不需要C/C++…此代码非常快。@Smiley-在您的问题中添加几行测试数据,我可以查看一下。再次感谢您的回复…以下是文件1中的几行:chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255+-20311210.00 chr5 26610220 26610295 5 26610221 26610296 K00230:40:HNWJLBBXX:4:1101:1022:24155 255+-2
import sys

# Hold all the file2 Columns
file2_col23 = []

# Given a space-separated row of data, return the Nth column as a real number
def getNthColumn(row, N):
    # Single-space the row, removing tabs, double-spaces etc.
    row = ' '.join(row.split())
    fields = row.split(' ')
    try:
        result = float(fields[N-1])   # fields are numbered 0->(N-1)
    except:
        sys.stderr.write("Failed to fetch number column %d from [%s]" % (N, row))
        sys.exit(1)
    #print("Returning column %d from [%s] -> %f" % (N, row, result))
    return result

if (len(sys.argv) == 3):
    fin1 = open(sys.argv[1], "rt")
    fin2 = open(sys.argv[2], "rt")  #TODO - handle file-not-found errors, etc.

    # Load in the whole of file2, but just the column2 & column3
    # note the minimum col2 and maximum c3
    line2 = fin2.readline()
    min_c2 = None
    max_c3 = None
    while (line2 != ""):
        col2 = getNthColumn(line2, 2)
        col3 = getNthColumn(line2, 3)
        file2_col23.append( ( col2, col3 ) )
        # Note the min c2 and max c3 so we can quickly know if a search can
        # possible produce a result
        if (min_c2 == None or col2 < min_c2):
            min_c2 = col2
        if (max_c3 == None or col3 > max_c3):
            max_c3 = col3
        # next line
        line2 = fin2.readline().strip()

    # sort the columns to allow us to short-cut searching
    file2_col23.sort()


    line1 = fin1.readline()
    while (line1 != ""):
        col11 = getNthColumn(line1, 11)

        matched = False
        # is col11 is within any file2 row col2 or col3
        if (col11 >= min_c2 and col11 <= max_c3):   # make sure the search is worthwhile
            for col23 in file2_col23:
                (col2, col3) = col23
                if (col11 >= col2 and col11 <= col3):
                    matched = True
                    break

        if (matched == True):
            print("MATCH: "+str(line1))
        else:
            print("NO-MATCH: "+str(line1))

        # Next row
        line1 = fin1.readline()
else:
    print("Give 2 files as arguments")