Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用正则表达式进行混乱文件名的模式匹配_Python_R_Regex - Fatal编程技术网

Python 使用正则表达式进行混乱文件名的模式匹配

Python 使用正则表达式进行混乱文件名的模式匹配,python,r,regex,Python,R,Regex,我对REs没有太多经验,但需要解析100个文件名来生成“元数据”数据集。我已经能够生成包含文件路径和文件名的文本文件。解析出完整的文件名对我来说很简单,但我需要能够从文件名中解析出“sampleid” 问题是“样本ID”的语法到处都是(参见随附的csv示例数据:目标是从“样本”列转到“ID”列)。我尝试了一系列strsplit()命令,但这非常麻烦,本质上不起作用。我还尝试过用一些基于语法结构的IF语句编写函数。我觉得这仍然不是一个好的解决方案,因为它仍然依赖于我在手之前手动识别不同的语法,而且

我对REs没有太多经验,但需要解析100个文件名来生成“元数据”数据集。我已经能够生成包含文件路径和文件名的文本文件。解析出完整的文件名对我来说很简单,但我需要能够从文件名中解析出“sampleid”

问题是“样本ID”的语法到处都是(参见随附的csv示例数据:目标是从“样本”列转到“ID”列)。我尝试了一系列strsplit()命令,但这非常麻烦,本质上不起作用。我还尝试过用一些基于语法结构的IF语句编写函数。我觉得这仍然不是一个好的解决方案,因为它仍然依赖于我在手之前手动识别不同的语法,而且我很容易错过一些东西,因为我必须用眼睛来做

在我看来,这是一个正则表达式问题,但我可以使用一些资源来帮助我开始。如果可能的话,我希望能够在R或Python中实现这一点。感谢您提供可能有用的任何资源或包/模块

dput(head(brain_ref, 25))
structure(list(file = c("/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/BXH12_1_brain_total_RNA_cDNA_GTCCGC.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/BXH12_2_brain_total_RNA_cDNA_CAGATC.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB13_1_brain_total_RNA_cDNA_ATGTCA.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB13_2_brain_total_RNA_cDNA_GTGAAA.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB17_1_brain_total_RNA_cDNA_CCGTCC.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB17_2_brain_total_RNA_cDNA_ATGTCA.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB2_1_brain_total_RNA_cDNA_GTCCGC.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB2_2_brain_total_RNA_cDNA_CTTGTA.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB25_1_brain_total_RNA_cDNA_AGTTCC.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB25_2_brain_total_RNA_cDNA_AGTCAA.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB27_1_brain_total_RNA_cDNA_CGATGT.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB27_2_brain_total_RNA_cDNA_AGTTCC.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB7_1_brain_total_RNA_cDNA_ACAGTG.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB7_2_brain_total_RNA_cDNA_AGTCAA.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/SHR_1_brain_total_RNA_cDNA_GCCAAT.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/SHR_2_brain_total_RNA_cDNA_TGACCA.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/ACI-SegHsd-2-brain-total-RNA_S17.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/BXH2-3-brain-total-RNA_S4.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/BXH5-3-brain-total-RNA_S3.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/BXH8-3-brain-total-RNA_S5.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/Cop-CrCrl-2-brain-total-RNA_S10.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/Dark-Agouti-1-brain-total-RNA_S16.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/Dark-Agouti-2-brain-total-RNA_S13.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/F344-NCI-1-brain-total-RNA_S18.genes.results", 
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/F344-NCI-2-brain-total-RNA_S15.genes.results"
), sample = c("BXH12_1_brain_total_RNA_cDNA_GTCCGC", "BXH12_2_brain_total_RNA_cDNA_CAGATC", 
"HXB13_1_brain_total_RNA_cDNA_ATGTCA", "HXB13_2_brain_total_RNA_cDNA_GTGAAA", 
"HXB17_1_brain_total_RNA_cDNA_CCGTCC", "HXB17_2_brain_total_RNA_cDNA_ATGTCA", 
"HXB2_1_brain_total_RNA_cDNA_GTCCGC", "HXB2_2_brain_total_RNA_cDNA_CTTGTA", 
"HXB25_1_brain_total_RNA_cDNA_AGTTCC", "HXB25_2_brain_total_RNA_cDNA_AGTCAA", 
"HXB27_1_brain_total_RNA_cDNA_CGATGT", "HXB27_2_brain_total_RNA_cDNA_AGTTCC", 
"HXB7_1_brain_total_RNA_cDNA_ACAGTG", "HXB7_2_brain_total_RNA_cDNA_AGTCAA", 
"SHR_1_brain_total_RNA_cDNA_GCCAAT", "SHR_2_brain_total_RNA_cDNA_TGACCA", 
"ACI-SegHsd-2-brain-total-RNA_S17", "BXH2-3-brain-total-RNA_S4", 
"BXH5-3-brain-total-RNA_S3", "BXH8-3-brain-total-RNA_S5", "Cop-CrCrl-2-brain-total-RNA_S10", 
"Dark-Agouti-1-brain-total-RNA_S16", "Dark-Agouti-2-brain-total-RNA_S13", 
"F344-NCI-1-brain-total-RNA_S18", "F344-NCI-2-brain-total-RNA_S15"
), batch = c("batch1", "batch1", "batch1", "batch1", "batch1", 
"batch1", "batch1", "batch1", "batch1", "batch1", "batch1", "batch1", 
"batch1", "batch1", "batch1", "batch1", "batch10", "batch10", 
"batch10", "batch10", "batch10", "batch10", "batch10", "batch10", 
"batch10"), ID = c("BXH12_1", "BXH12_2", "HXB13_1", "HXB13_2", 
"HXB17_1", "HXB17_2", "HXB2_1", "HXB2_2", "HXB25_1", "HXB25_2", 
"HXB27_1", "HXB27_2", "HXB7_1", "HXB7_2", "SHR_1", "SHR_2", "ACI-SegHsd_2", 
"BXH2_3", "BXH5_3", "BXH8_3", "Cop-CrCrl_2", "Dark-Agouti_1", 
"Dark-Agouti_2", "F344-NCI_1", "F344-NCI_2")), row.names = c(NA, 
25L), class = "data.frame")

如果所有样本都包含
\u brain
-brain
,并且您希望在保存之前保存这些内容,您可以执行以下操作:

names=c(“ABBA-1_2-brain-total2”、“BABBA-2_2-brain-total2”、“ARA_1-1_2-brain-total2”)
gsub(“(.*.brain.*”,“\\1”,名称)
#>[1]“ABBA-1_2”“BABBA-2_2”“ARA_1-1_2”
灵感来自@r2evans:


samples在
R
中,我们还可以删除子字符串而不是捕获

sub("[-_]brain.*", "", names)
#[1] "ABBA-1_2"  "BABBA-2_2" "ARA_1-1_2"

或使用
trimws

trimws(names, whitespace = "[-_]brain.*")
#[1] "ABBA-1_2"  "BABBA-2_2" "ARA_1-1_2"

以OP公司为例

library(dplyr)
library(stringr)
brain_ref <-  brain_ref %>%
                 mutate(newsample = str_remove(sample,  "[_-]brain.*"))
库(dplyr)
图书馆(stringr)
大脑参考%
突变(newsample=str_remove(样本,[[u-]brain.*))
数据
名称为什么不:

for name in names:
    m = re.match(r'(.*)[_\-]brain[_\-]total', name)
    print(m.group(1))

你能在
R
中用
dput
而不是image
sub(“-([0-9]+)-brain.*”、“\u1”,“SHR-1-brain-total-RNA\u S6”)
显示示例数据吗?@r2evans有时是下划线,而不是破折号。好的,
sub([0-9]+)-brain.*”、“\u1”,“SHR-1-brain-total-RNA\u S6”)
(1)请不要发布代码/数据/错误的图像:它无法复制或搜索(SEO),它会破坏屏幕阅读器,并且可能不适合某些移动设备。参考:(和)。请直接包括代码或数据(例如,
dput(头(x))
data.frame(…)
)。(2) 即使你说有链接(而且没有),我还是建议链接过时,导致一个无法产生的问题。请按照akrun的建议,使用
dput
或编程方式(
data.frame
vec Try
“SHR-1-brain-total-RNA-S6”
)包含样本数据,应该返回
SHR_1
,但不是。(请参阅几分钟前我的评论。)我不确定你的意思。我尝试了
“SHR-1-brain-total-RNA-S6”
它确实返回了
SHR-1
,这是它应该返回的。我看不出有理由否决投票?对不起,我想这很清楚。
gsub((.*).brain.*,“\\1”,“SHR-1-brain-total-RNA-S6”)
返回
“SHR-1”
,但OP显示了
“SHR\u 1”
。所以当我说“应该返回”时,我是建议
“SHR-1”
是错误的。由于OP编辑了问题并更改了样本和期望值,因此不再有以
SHR-1
开头并期望
SHR_1
的字符串,因此我愿意取消对该问题的否决票。由于so不允许我撤回否决票,除非对答案进行编辑,因此我将您的答案编辑为(非常简单)更改一点,这样我就可以收回否决票。如果你愿意,请随意回滚我的编辑。谢谢。这几乎可以工作,但在数字靠近“大脑”的情况下(例如“F344”Stm_1braintotalRNAb30_S8_L001_R1_001”)@HarrySmith查看是否现在在第二个
[\u-]后面加上星号
和之前的
大脑
工作。
[\u-]*
现在表示模式的零次或多次重复。
names <- c("ABBA-1_2-brain-total2", "BABBA-2_2_brain-total2", "ARA_1-1_2-brain-total2")
for name in names:
    m = re.match(r'(.*)[_\-]brain[_\-]total', name)
    print(m.group(1))