R 将匹配的字符串替换为其子组
我有一些DNA序列需要处理,它们看起来像:R 将匹配的字符串替换为其子组,r,regex,bioinformatics,R,Regex,Bioinformatics,我有一些DNA序列需要处理,它们看起来像: >KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT TTT
>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA
>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
library(tidyverse)
SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta
Sequence <- str_replace_all(SequenceRaw,
"(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n",
">\\3 \\4\n") ## Keep '>' and add a new line with '\n'
write_file(Sequence, "YOUR PATH\\sequence.fasta")
为了检测每个序列的标题行,我对正则表达式进行了编码:
>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA
>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
library(tidyverse)
SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta
Sequence <- str_replace_all(SequenceRaw,
"(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n",
">\\3 \\4\n") ## Keep '>' and add a new line with '\n'
write_file(Sequence, "YOUR PATH\\sequence.fasta")
我应该使用什么函数将整个匹配替换为它的group3+group4?此外,我在一个txt文件中有72个匹配项,如何在一次运行中替换它们?我自己用tidyverse软件包解决了这个问题:
>KU508975.1 Acalypha australis maturase K (matK) gene, partial cds; chloroplast
TAAATTATGTGTCAGAGCTATTAATACCTTACCCCATCCATCTAGAAAAATGGGTTCAAATTCTTCGATA
TTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTCATGAATATTGGAATTGGAACTGT
TTTCTTATTCCAAAGAAATCGATTGCTATTTTTACAAAAAGTAATCCAAGATTTTTCTTGTTTCTATATA
>KC747175.1 Achyranthes bidentata bio-material USDA:GRIN:PI613015 maturase K (matK) gene, partial cds; chloroplast
GATATATTAATACCTTACCCCGCTCATCTAGAAATCTTGGTTCAAACTCTCCGATACTGGTTGAAAGATG
CTTCTTCTTTGCATTTATTACGATTCTTTCTTTATGAGTGTCGTAATTGGATTAGTCTTATTACTCCAAA
AAAATCCATTTCCTTTTTGAAAAAAAGGAATCGAAGATTATTCTTGTTCCTATATAATTTCTATGTATGT
(\>)([A-Z]{2}\d{6}\.?\d)\s([a-zA-Z]+\-?[a-zA-Z]+)\s([a-zA-Z]+\-?[a-zA-Z]+)\s(.*)\n
library(tidyverse)
SequenceRaw <- read_file("PATH OF SEQUENCE FILE\\sequenceraw.fasta") ## e.g. sequenceraw.fasta
Sequence <- str_replace_all(SequenceRaw,
"(\\>)([A-Z]{2}\\d{6}\\.?\\d)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s([a-zA-Z]+\\-?[a-zA-Z]+)\\s(.*)\\n",
">\\3 \\4\n") ## Keep '>' and add a new line with '\n'
write_file(Sequence, "YOUR PATH\\sequence.fasta")
库(tidyverse)
SequenceRaw当前正则表达式不适用于组3或4包含单个字母单词的行,因为[a-zA-Z]+\\-?[a-zA-Z]+
匹配1+个字母,然后是可选的连字符,然后是1+个字母(这意味着必须至少有2个字母)。使用[a-zA-Z]+(?:-[a-zA-Z]+)?
,您可以匹配1+个字母,然后是可选的-
序列,然后是1+个字母
另外,\s
也匹配换行符,如果标题行比您假设的短,则*
可能会错误地抓取序列行。您可以使用\h
或[\t]
请注意,\n
在模式末尾不是必需的,因为*
将除换行符以外的任何0+字符与ICU正则表达式库匹配(它用于当前代码中,str\u replace\u all
)
一般来说,您应该只使用(…)
捕获您需要保留的内容,其他所有内容都可以匹配。删除额外的捕获括号,将节省一些性能
如果在开始处添加(?m)^
,您将确保只匹配行开始处的
你可以用
"(?m)^>[A-Z]{2}\\d{6}\\.?\\d\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?)\\h+([a-zA-Z]+(?:-[a-zA-Z]+)?).*"
看
代码:
序列我假设每个序列有一个唯一的物种?或者你的输出将有重复的FASTA ID,这是你不想要的。是的,每个序列都是唯一的。你当前的正则表达式不适用于第3组或第4组为单个字母单词的行。另外,\s
也匹配换行符,如果标题行比您假设的短,则*
可能会错误地获取序列行。@WiktorStribiżew感谢您指出,尽管此正则表达式很适合我的数据。你介意给出一个更严格的吗?见下文。