Python 提取两个子字符串之间匹配的字符串部分
我有三个包含一组字符串的文件。File1和File2包含File3的子字符串。我想从File3中减去位于File1和File2中的子字符串之间的字符串。请参见下面的示例: 文件1(子字符串1): 文件2(子字符串2) 文件3 例如:Python 提取两个子字符串之间匹配的字符串部分,python,r,perl,pattern-matching,substring,Python,R,Perl,Pattern Matching,Substring,我有三个包含一组字符串的文件。File1和File2包含File3的子字符串。我想从File3中减去位于File1和File2中的子字符串之间的字符串。请参见下面的示例: 文件1(子字符串1): 文件2(子字符串2) 文件3 例如: String in File1 String in File2 AGGGCUUAGCUGCUU
String in File1 String in File2
AGGGCUUAGCUGCUUGUGAGCA UUCACAGUGGCUAAGUUCCGC
String in File3 CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG
此示例的输出:
GGGUCCACACCAAGUCGUG
在R中有一个解决方案:
file1 <- "AGGGCUUAGCUGCUUGUGAGCA"
file2 <- "UUCACAGUGGCUAAGUUCCGC"
file3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
# create a regular expression
pattern <- paste0(".*", file1, "(.*)", file2, ".*")
# extract the substring
sub(pattern, "\\1", file3)
# [1] "GGGUCCACACCAAGUCGUG"
python中的file1
>>> a='AGGGCUUAGCUGCUUGUGAGCA'
>>> b='UUCACAGUGGCUAAGUUCCGC'
>>> c='CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG'
>>> regex = a + '(.*?)' + b
>>> regex
'AGGGCUUAGCUGCUUGUGAGCA(.*?)UUCACAGUGGCUAAGUUCCGC'
>>> re.findall(regex,c)
['GGGUCCACACCAAGUCGUG']
使用gsubfn中的Straplyc
尝试此操作。我们假设只有一个s1
和s2
实例,或者如果有多个实例,您希望字符串位于s1
的第一个实例和s2
的最后一个实例之间。如果可能有多个实例,并且您想要不同的内容,请将此添加到问题中
s1 <- "AGGGCUUAGCUGCUUGUGAGCA"
s2 <- "UUCACAGUGGCUAAGUUCCGC"
s3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
library(gsubfn)
fn$strapplyc(s3, "$s1(.*)$s2", simplify = TRUE)
## [1] "GGGUCCACACCAAGUCGUG"
python中的s1代码
`
string1 = "AGGGCUUAGCUGCUUGUGAGCA"
string2 = "UUCACAGUGGCUAAGUUCCGC"
string_main = "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
print string_main[string_main.find(string1)+len(string1):string_main.find(string2)]
string1=“agggcuagcuugca”
string2=“uucacagugggcuaaguuccgc”
string_main=“cugaggagggcuagcuugcuugugugugguccacacacagugugugugugucagugcuagagugcuagugcuagugucccccag”
打印string\u main[string\u main.find(string1)+len(string1):string\u main.find(string2)]在Perl中,您可以尝试以下代码:
use strict;
use warnings;
my $file1 = "AGGGCUUAGCUGCUUGUGAGCA";
my $file2 = "UUCACAGUGGCUAAGUUCCGC";
my $file3 = "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG";
my ($result) = $file3 =~ /$file1(.*?)$file2/;
print $result;
产出:
GGGUCCACACCAAGUCGUG
根据您给定的输入,以下操作将起作用
f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
strsplit(f3, paste(f1, f2, sep='|'))[[1]][2]
# [1] "GGGUCCACACCAAGUCGUG"
f1在R中使用qdapRegex
f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
library(qdapRegex)
rm_between(f3, f1, f2, extract=TRUE)
## [[1]]
## [1] "GGGUCCACACCAAGUCGUG"
f1这两个子字符串是什么?将您的代码放在这里,然后我们可以查看您正面临问题的位置。我已编辑了我的问题。我在文件1、2和3中有多个字符串。@user3741035是否要使用文件1和文件2中所有字符串的组合?
GGGUCCACACCAAGUCGUG
f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
strsplit(f3, paste(f1, f2, sep='|'))[[1]][2]
# [1] "GGGUCCACACCAAGUCGUG"
f1 <- "AGGGCUUAGCUGCUUGUGAGCA"
f2 <- "UUCACAGUGGCUAAGUUCCGC"
f3 <- "CUGAGGAGCAGGGCUUAGCUGCUUGUGAGCAGGGUCCACACCAAGUCGUGUUCACAGUGGCUAAGUUCCGCCCCCCAG"
library(qdapRegex)
rm_between(f3, f1, f2, extract=TRUE)
## [[1]]
## [1] "GGGUCCACACCAAGUCGUG"