Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/286.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何创建索引来解析大文本文件_Python_Sqlite_File_Indexing_Fastq - Fatal编程技术网

Python 如何创建索引来解析大文本文件

Python 如何创建索引来解析大文本文件,python,sqlite,file,indexing,fastq,Python,Sqlite,File,Indexing,Fastq,我有两个FASTQ格式的文件A和B,它们基本上是数亿行文本,以@开头的4行为一组,如下所示: @120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1 GCCAATGGCATGGTTTCATGGATGTTAGCAGAAGACATGAGACTTCTGGGACAGGAGCAAAACACTTCATGATGGCAAAAGATCGGAAGAGCACACGTCTGAACTCN +120412_SN549_0058_BD0UMKACXX:5:1101:1156:20

我有两个FASTQ格式的文件A和B,它们基本上是数亿行文本,以@开头的4行为一组,如下所示:

@120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
GCCAATGGCATGGTTTCATGGATGTTAGCAGAAGACATGAGACTTCTGGGACAGGAGCAAAACACTTCATGATGGCAAAAGATCGGAAGAGCACACGTCTGAACTCN
+120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
bbbeee_[_ccdccegeeghhiiehghifhfhhhiiihhfhghigbeffeefddd]aegggdffhfhhihbghhdfffgdb^beeabcccabbcb`ccacacbbccB
我需要比较一下

5:1101:1156:2031#0/
在文件A和B之间分开,并在文件B中写入与新文件匹配的4行组。我用python编写了一段代码,可以做到这一点,但它只适用于小文件,因为它对文件a中的每一行解析文件B的整行,而这两个文件都包含数亿行

有人建议我应该为文件B创建一个索引;我在谷歌上到处搜索都没有成功,如果有人能指出怎么做,或者让我知道一个教程,我会非常感激,这样我就可以学习了。谢谢

==编辑==
理论上,每组4行在每个文件中只应存在一次。如果在每次比赛后中断解析,速度是否足够快?还是我需要一个完全不同的算法?

这些家伙声称在使用专用库时解析一些GIG文件,请参阅


IMO更好的方法是对其进行一次解析并将数据加载到某个数据库中,而不是将其加载到
输出文件(即mysql)中,然后在那里运行查询索引只是您正在处理的信息的缩短版本。在本例中,您需要“键”-在@-行上的第一个冒号(“:”)和靠近结尾的最后一个斜杠(“/”)之间的文本-以及某种值

由于本例中的“值”是4行块的全部内容,并且由于我们的索引将为每个块存储一个单独的条目,因此如果我们在索引中使用实际值,我们将把整个文件存储在内存中

相反,让我们使用4行块开头的文件位置。这样,您可以移动到该文件位置,打印4行,然后停止。总成本是存储整数文件位置所需的4或8个字节,而不是实际基因组数据的字节数

这里有一些代码可以完成这项工作,但也可以进行大量的验证和检查。你可能想扔掉你不用的东西

import sys

def build_index(path):
    index = {}
    for key, pos, data in parse_fastq(path):
        if key not in index:
            # Don't overwrite duplicates- use first occurrence.
            index[key] = pos

    return index

def error(s):
    sys.stderr.write(s + "\n")

def extract_key(s):
    # This much is fairly constant:
    assert(s.startswith('@'))
    (machine_name, rest) = s.split(':', 1)
    # Per wikipedia, this changes in different variants of FASTQ format:
    (key, rest) = rest.split('/', 1)
    return key

def parse_fastq(path):
    """
    Parse the 4-line FASTQ groups in path.
    Validate the contents, somewhat.
    """
    f = open(path)
    i = 0
    # Note: iterating a file is incompatible with fh.tell(). Fake it.
    pos = offset = 0
    for line in f:
        offset += len(line)
        lx = i % 4
        i += 1
        if lx == 0:     # @machine: key
            key = extract_key(line)
            len1 = len2 = 0
            data = [ line ]
        elif lx == 1:
            data.append(line)
            len1 = len(line)
        elif lx == 2:   # +machine: key or something
            assert(line.startswith('+'))
            data.append(line)
        else:           # lx == 3 : quality data
            data.append(line)
            len2 = len(line)

            if len2 != len1:
                error("Data length mismatch at line "
                        + str(i-2)
                        + " (len: " + str(len1) + ") and line "
                        + str(i)
                        + " (len: " + str(len2) + ")\n")
            #print "Yielding @%i: %s" % (pos, key)
            yield key, pos, data
            pos = offset

    if i % 4 != 0:
        error("EOF encountered in mid-record at line " + str(i));

def match_records(path, index):
    results = []
    for key, pos, d in parse_fastq(path):
        if key in index:
            # found a match!
            results.append(key)

    return results

def write_matches(inpath, matches, outpath):
    rf = open(inpath)
    wf = open(outpath, 'w')

    for m in matches:
        rf.seek(m)
        wf.write(rf.readline())
        wf.write(rf.readline())
        wf.write(rf.readline())
        wf.write(rf.readline())

    rf.close()
    wf.close()

#import pdb; pdb.set_trace()
index = build_index('afile.fastq')
matches = match_records('bfile.fastq', index)
posns = [ index[k] for k in matches ]
write_matches('afile.fastq', posns, 'outfile.fastq')
请注意,此代码返回到第一个文件以获取数据块。如果文件之间的数据相同,则可以在匹配时从第二个文件复制块


还请注意,根据您试图提取的内容,您可能需要更改输出块的顺序,并且您可能需要确保关键帧是唯一的,或者可能需要确保关键帧不是唯一的,而是按照匹配的顺序重复。这取决于你-我不确定你对这些数据做了什么。

>非常感谢。对于像我这样的初学者来说,这是一篇不错的文章,但是看看代码对我的学习有好处。我添加了一个文件和一个文件作为
sys.argv[1]
sys.argv[2]
,以便将其作为命令行运行(这样做通常不会遇到问题)。但是,当我尝试运行它时,它会显示
/some/path/code.py(92)(
->index=build\u index(afile)
(Pdb)
,并且仍然停留在那里。它没有给出一个错误,它只是停留在…对不起!我在代码中留下了对调试器的调用。带有“import pdb;pdb.set_trace()”的行会导致代码停止并调用调试器。注释掉或删除该行,它应该贯穿其中。(或者,使用“n”表示下一步,使用“s”表示单步执行,使用“p expr”表示打印expr,以便在代码运行时观察代码。)现在我收到以下警告,但没有输出。。。很抱歉打扰您:
回溯(最后一次调用):
文件“py\u fetch\u pair.py”,第94行,在
index=build\u index(afile)
文件“py\u fetch\u pair.py”,第12行,在build\u index中,用于解析\u fastq(路径)中的密钥、pos、数据:
文件“py\u fetch\u pair.py”,第32行,在parse_fastq中
f=open(path)
类型错误:强制使用Unicode:需要字符串或缓冲区,找到文件
无需担心,它正在工作!!这是我的错,脚本需要一个字符串,但我在尝试实现
sys.argv[]
时将文件放在了那里。谢谢!有人能解释一下[parse_fastq]中[pos和data]的用途吗?在脚本的其余部分中,在何处以及如何使用这些内容?也许我错过了什么。谢谢
import sys

def build_index(path):
    index = {}
    for key, pos, data in parse_fastq(path):
        if key not in index:
            # Don't overwrite duplicates- use first occurrence.
            index[key] = pos

    return index

def error(s):
    sys.stderr.write(s + "\n")

def extract_key(s):
    # This much is fairly constant:
    assert(s.startswith('@'))
    (machine_name, rest) = s.split(':', 1)
    # Per wikipedia, this changes in different variants of FASTQ format:
    (key, rest) = rest.split('/', 1)
    return key

def parse_fastq(path):
    """
    Parse the 4-line FASTQ groups in path.
    Validate the contents, somewhat.
    """
    f = open(path)
    i = 0
    # Note: iterating a file is incompatible with fh.tell(). Fake it.
    pos = offset = 0
    for line in f:
        offset += len(line)
        lx = i % 4
        i += 1
        if lx == 0:     # @machine: key
            key = extract_key(line)
            len1 = len2 = 0
            data = [ line ]
        elif lx == 1:
            data.append(line)
            len1 = len(line)
        elif lx == 2:   # +machine: key or something
            assert(line.startswith('+'))
            data.append(line)
        else:           # lx == 3 : quality data
            data.append(line)
            len2 = len(line)

            if len2 != len1:
                error("Data length mismatch at line "
                        + str(i-2)
                        + " (len: " + str(len1) + ") and line "
                        + str(i)
                        + " (len: " + str(len2) + ")\n")
            #print "Yielding @%i: %s" % (pos, key)
            yield key, pos, data
            pos = offset

    if i % 4 != 0:
        error("EOF encountered in mid-record at line " + str(i));

def match_records(path, index):
    results = []
    for key, pos, d in parse_fastq(path):
        if key in index:
            # found a match!
            results.append(key)

    return results

def write_matches(inpath, matches, outpath):
    rf = open(inpath)
    wf = open(outpath, 'w')

    for m in matches:
        rf.seek(m)
        wf.write(rf.readline())
        wf.write(rf.readline())
        wf.write(rf.readline())
        wf.write(rf.readline())

    rf.close()
    wf.close()

#import pdb; pdb.set_trace()
index = build_index('afile.fastq')
matches = match_records('bfile.fastq', index)
posns = [ index[k] for k in matches ]
write_matches('afile.fastq', posns, 'outfile.fastq')