Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/65.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 将匹配的字符串替换为其子组_R_Regex_Bioinformatics - Fatal编程技术网

R 将匹配的字符串替换为其子组

R 将匹配的字符串替换为其子组,r,regex,bioinformatics,R,Regex,Bioinformatics,我有一些DNA序列需要处理,它们看起来像: >KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT TTT

我有一些DNA序列需要处理,它们看起来像:

>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA

>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
library(tidyverse)

SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta

Sequence <- str_replace_all(SequenceRaw, 
    "(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n", 
    ">\\3 \\4\n") ## Keep '>' and add a new line with '\n'

write_file(Sequence, "YOUR PATH\\sequence.fasta")
为了检测每个序列的标题行,我对正则表达式进行了编码:

>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA

>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
library(tidyverse)

SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta

Sequence <- str_replace_all(SequenceRaw, 
    "(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n", 
    ">\\3 \\4\n") ## Keep '>' and add a new line with '\n'

write_file(Sequence, "YOUR PATH\\sequence.fasta")


我应该使用什么函数将整个匹配替换为它的group3+group4?此外,我在一个txt文件中有72个匹配项,如何在一次运行中替换它们?

我自己用tidyverse软件包解决了这个问题:

>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA

>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
library(tidyverse)

SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta

Sequence <- str_replace_all(SequenceRaw, 
    "(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n", 
    ">\\3 \\4\n") ## Keep '>' and add a new line with '\n'

write_file(Sequence, "YOUR PATH\\sequence.fasta")
库(tidyverse)

SequenceRaw当前正则表达式不适用于组3或4包含单个字母单词的行,因为
[a-zA-Z]+\\-?[a-zA-Z]+
匹配1+个字母,然后是可选的连字符,然后是1+个字母(这意味着必须至少有2个字母)。使用
[a-zA-Z]+(?:-[a-zA-Z]+)?
,您可以匹配1+个字母,然后是可选的
-
序列,然后是1+个字母

另外,
\s
也匹配换行符,如果标题行比您假设的短,则
*
可能会错误地抓取序列行。您可以使用
\h
[\t]

请注意,
\n
在模式末尾不是必需的,因为
*
将除换行符以外的任何0+字符与ICU正则表达式库匹配(它用于当前代码中,
str\u replace\u all

一般来说,您应该只使用
(…)
捕获您需要保留的内容,其他所有内容都可以匹配。删除额外的捕获括号,将节省一些性能

如果在开始处添加
(?m)^
,您将确保只匹配行开始处的

你可以用

"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*"

代码:


序列我假设每个序列有一个唯一的物种?或者你的输出将有重复的FASTA ID,这是你不想要的。是的,每个序列都是唯一的。你当前的正则表达式不适用于第3组或第4组为单个字母单词的行。另外,
\s
也匹配换行符,如果标题行比您假设的短,则
*
可能会错误地获取序列行。@WiktorStribiżew感谢您指出,尽管此正则表达式很适合我的数据。你介意给出一个更严格的吗?见下文。