Python 从大文件中特定出现的模式中提取名称

Python 从大文件中特定出现的模式中提取名称,python,r,bash,string-matching,Python,R,Bash,String Matching,我有一个FASTA文件,它基本上是一个文本文件,用于描述生物序列数据(),包含10000多个FASTA序列(从>)。文件的开头如下所示: >Gene A GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT >Gene B GAACTACACAAACGTAAAATGTAAAACAA

我有一个FASTA文件,它基本上是一个文本文件,用于描述生物序列数据(),包含10000多个FASTA序列(从>)。文件的开头如下所示:

>Gene A
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT
>Gene B
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATTACACCGTTAGCAGTGTGAGCAAAAACGATTAA
AAAGTAAATATTATAAAAGCCCTC
>Gene C
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
>Gene D
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
以此类推,大约有10000个基因。 我想:

  • 找出哪些基因包含特定的模式(CTTTGTA)
  • 这种模式在该基因中出现了多少次
  • 以模式的频率导出包含模式的基因名称列表
  • 欢迎使用Bash或Python(或R)的任何解决方案

    顺便说一句,到目前为止我已经尝试过但没有成功:将基因及其序列提取到单独的文件中,然后在单独的文件中对模式进行grep。但是,我不能生成这些单独的文件。我曾经

    grep '^>' file.txt > new_file.txt
    

    但是我得到的输出是一个只包含所有基因名称的单一文件。

    这里有一个使用
    stringi
    包的R解决方案。由于没有单个文本文件或类似文件可作为可复制的示例访问,因此我使用
    cat()
    readlines()
    读取表示您提供的行副本的临时文本。请同时检查计时基准,可能对大型文件感兴趣

    sequences = open('fastafile.txt').read().split('>')  # Creates a list of sequences.
    
    needle = 'CTTTGTA'
    
    occurrences = {}
    
    for sequence in sequences:
       occ = sequence.count(needle)  # Returns the number of times the substring occurs in the string sequence.
       if occ:  # If greater than 0, create an entry in our dictionary. The sequence being the key and the count the value.
          occurrences[sequence] = occ
    
    output = []
    
    sorted_occurrences = sorted(occurrences.items(), key=operator.itemgetter(1))  # Sort the dictionary by length, so sequences with the highest occurrence of the needle appear at the top.
    
    for seq, occ_count in sorted_occurrences.iteritems():
        gene_name, sequence = seq.split('\n')
        formatted_line = '{gene_name} - {occ_count}'.format(gene_name=gene_name, occ_count=str(occ_count))  # Format the lines the way you want.
        output.append(formatted_line)  
    
    with open('occurences.txt') as o_f:
        o_f.write('\n'.join(output))
    
    library(stringi)
    
    cat(">Gene A
    GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
        TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT
        >Gene B
        GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
        TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATTACACCGTTAGCAGTGTGAGCAAAAACGATTAA
        AAAGTAAATATTATAAAAGCCCTC
        >Gene C
        AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
        GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
        TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
        ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
        AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
        CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
        ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
        >Gene D
        AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
        GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
        TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
        ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
        AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
        CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
        ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
    ", file = "tempfile.txt")
    
    genes <- readLines("tempfile.txt", n=-1)
    unlink("tempfile.txt")
    
    genes <- unlist(stri_split_fixed(paste(genes, collapse = " "), ">"))
    genes <- genes[ genes != ""]
    
    genenames <- unlist(stri_extract_all_regex(genes, "Gene \\w+"))
    genes <- stri_replace_all_fixed(genes, genenames, "")
    names(genes) <- genenames
    
    genes <- gsub("\\s+", "", genes, perl = T) 
    
    gene_pattern_freq <- function(str, patterns) {
    
      res <- sapply(patterns, function(p) {
    
        stringi::stri_count_fixed(str, p)
    
      }, USE.NAMES = T)
    
      rownames(res) <- names(str)
    
      return(res)
    }
    
    searchpatterns <- c("AA", "GT", "GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA")
    result <- gene_pattern_freq(genes, searchpatterns)
    result
    #        AA GT GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA
    # Gene A 14  6                                      1
    # Gene B 21 10                                      1
    # Gene C 52 18                                      0
    # Gene D 52 18                                      0
    
    library(microbenchmark)
    microbenchmark(gene_pattern_freq(genes, searchpatterns))
    # Unit: microseconds
    # expr                                      min     lq    mean   median   uq     max   neval
    # gene_pattern_freq(genes, searchpatterns) 68.687 77.371 123.438 78.161 79.345 4479.19   100
    
    #export
    write.csv(result, file = "../mypath/gene_pattern_freq_result.csv" )
    
    库(stringi)
    猫(“>基因A
    GACTACACAAACGTAATGTAAAACACAAGGTATATAATTCACAGAGAGAGAGACAGATAGAGACAGATAGACAGATAGACAGATATATTA
    TCTTTTTTTTTTTTATGTATGATACATAATATAACGTTCAAATT
    >基因B
    GACTACACAAACGTAATGTAAAACACAAGGTATATAATTCACAGAGAGAGAGACAGATAGAGACAGATAGACAGATAGACAGATATATTA
    TCTTATTTTTTTTTATGTATATATATATATATACGTCACACATACACACACACCTTATAGTGATCACACACACACACACACACACACACACAGCATAGAGTGATCACACACACACACACACAACGATAA
    AAAGTAAATATATAAAGCCCTC
    >基因C
    AACACAAATTGCCATCTACCCCGTTTGATCCTGTAATATATACTTGCCCAGATTGTGCAGCATCCTAGTAGTTGG
    GCTGGGTGGCCCACAAGCGATAACATTTTAACATTTTTATATATGTTACTTTTAAGTTTTTTTCCG
    TACTTGCCATCAAACTCGTTAGTTTGTACACACACACACAGCTTATTATATAAAATCTGATACGTACACACTCGTC
    attagaaagtcaaagagaaagagaaagagagaaagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagag
    aaaataccgaaaatataactgatgatgattgccaactacctgcgactatgttaaacccaacccttcccctcttac
    CTCTTATTCAATCAATCAATCAAGCGTGTGTAGAAGATCACTTATTTTTTCTGAGAATCAATAACAATAACACAAAACG
    动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作
    >基因D
    AACACAAATTGCCATCTACCCCGTTTGATCCTGTAATATATACTTGCCCAGATTGTGCAGCATCCTAGTAGTTGG
    GCTGGGTGGCCCACAAGCGATAACATTTTAACATTTTTATATATGTTACTTTTAAGTTTTTTTCCG
    TACTTGCCATCAAACTCGTTAGTTTGTACACACACACACAGCTTATTATATAAAATCTGATACGTACACACTCGTC
    attagaaagtcaaagagaaagagaaagagagaaagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagagag
    aaaataccgaaaatataactgatgatgattgccaactacctgcgactatgttaaacccaacccttcccctcttac
    CTCTTATTCAATCAATCAATCAAGCGTGTGTAGAAGATCACTTATTTTTTCTGAGAATCAATAACAATAACACAAAACG
    动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作动作
    ,file=“tempfile.txt”)
    
    欢迎来到“飞佬”的基因,如果答案有帮助的话,请把它标记为正确并考虑投票。谢谢更新了我的答案,以便它输出基因名而不是序列。好吧,这将适用于这些事件,但不适用于文件。除非他将文件内容粘贴到代码中,将每个序列声明为一个变量,并按照您的方式进行清理。您可能应该发布一个解决方案,打开有问题的文件并使用其内容而不是硬编码变量。@AlexisDevarenes发布的链接确实提供了一个文本文件作为可复制的示例,或者我遗漏了什么?在描述中,它说:有10000多个FASTA序列。所以我认为它的FAAAR比网站上显示的要大。我也找不到这个文本文件。@Manuel Bickel:我发布的链接只是对FASTA格式的描述。正如AlexisDevarenes正确指出的那样,我的实际文件中有10000多个条目。@AlexisDevarenes更新了我的答案,数据现在是示例行的副本。我还添加了一个基准,并对您的替换部件解决方案的基准感兴趣。Hi@alexisdevarenes。谢谢你可能的解决方案。但是,在运行脚本时,我得到了一个NameError:没有定义名称“sorted_事件”。如果我不排序,只是尝试获取列表(根据已排序的事件),那么我会得到:gene_name,sequence=seq.split('\n')value错误:太多的值无法解压,无法工作!!仍然是相同的NameError和ValueError