Python 使用正则表达式进行混乱文件名的模式匹配
我对REs没有太多经验,但需要解析100个文件名来生成“元数据”数据集。我已经能够生成包含文件路径和文件名的文本文件。解析出完整的文件名对我来说很简单,但我需要能够从文件名中解析出“sampleid” 问题是“样本ID”的语法到处都是(参见随附的csv示例数据:目标是从“样本”列转到“ID”列)。我尝试了一系列strsplit()命令,但这非常麻烦,本质上不起作用。我还尝试过用一些基于语法结构的IF语句编写函数。我觉得这仍然不是一个好的解决方案,因为它仍然依赖于我在手之前手动识别不同的语法,而且我很容易错过一些东西,因为我必须用眼睛来做 在我看来,这是一个正则表达式问题,但我可以使用一些资源来帮助我开始。如果可能的话,我希望能够在R或Python中实现这一点。感谢您提供可能有用的任何资源或包/模块Python 使用正则表达式进行混乱文件名的模式匹配,python,r,regex,Python,R,Regex,我对REs没有太多经验,但需要解析100个文件名来生成“元数据”数据集。我已经能够生成包含文件路径和文件名的文本文件。解析出完整的文件名对我来说很简单,但我需要能够从文件名中解析出“sampleid” 问题是“样本ID”的语法到处都是(参见随附的csv示例数据:目标是从“样本”列转到“ID”列)。我尝试了一系列strsplit()命令,但这非常麻烦,本质上不起作用。我还尝试过用一些基于语法结构的IF语句编写函数。我觉得这仍然不是一个好的解决方案,因为它仍然依赖于我在手之前手动识别不同的语法,而且
dput(head(brain_ref, 25))
structure(list(file = c("/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/BXH12_1_brain_total_RNA_cDNA_GTCCGC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/BXH12_2_brain_total_RNA_cDNA_CAGATC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB13_1_brain_total_RNA_cDNA_ATGTCA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB13_2_brain_total_RNA_cDNA_GTGAAA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB17_1_brain_total_RNA_cDNA_CCGTCC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB17_2_brain_total_RNA_cDNA_ATGTCA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB2_1_brain_total_RNA_cDNA_GTCCGC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB2_2_brain_total_RNA_cDNA_CTTGTA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB25_1_brain_total_RNA_cDNA_AGTTCC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB25_2_brain_total_RNA_cDNA_AGTCAA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB27_1_brain_total_RNA_cDNA_CGATGT.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB27_2_brain_total_RNA_cDNA_AGTTCC.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB7_1_brain_total_RNA_cDNA_ACAGTG.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/HXB7_2_brain_total_RNA_cDNA_AGTCAA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/SHR_1_brain_total_RNA_cDNA_GCCAAT.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch1/ensembl_v96/SHR_2_brain_total_RNA_cDNA_TGACCA.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/ACI-SegHsd-2-brain-total-RNA_S17.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/BXH2-3-brain-total-RNA_S4.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/BXH5-3-brain-total-RNA_S3.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/BXH8-3-brain-total-RNA_S5.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/Cop-CrCrl-2-brain-total-RNA_S10.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/Dark-Agouti-1-brain-total-RNA_S16.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/Dark-Agouti-2-brain-total-RNA_S13.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/F344-NCI-1-brain-total-RNA_S18.genes.results",
"/data/rn6/quantitation/brainTotalRNA/RI/batch10/ensembl_v96/F344-NCI-2-brain-total-RNA_S15.genes.results"
), sample = c("BXH12_1_brain_total_RNA_cDNA_GTCCGC", "BXH12_2_brain_total_RNA_cDNA_CAGATC",
"HXB13_1_brain_total_RNA_cDNA_ATGTCA", "HXB13_2_brain_total_RNA_cDNA_GTGAAA",
"HXB17_1_brain_total_RNA_cDNA_CCGTCC", "HXB17_2_brain_total_RNA_cDNA_ATGTCA",
"HXB2_1_brain_total_RNA_cDNA_GTCCGC", "HXB2_2_brain_total_RNA_cDNA_CTTGTA",
"HXB25_1_brain_total_RNA_cDNA_AGTTCC", "HXB25_2_brain_total_RNA_cDNA_AGTCAA",
"HXB27_1_brain_total_RNA_cDNA_CGATGT", "HXB27_2_brain_total_RNA_cDNA_AGTTCC",
"HXB7_1_brain_total_RNA_cDNA_ACAGTG", "HXB7_2_brain_total_RNA_cDNA_AGTCAA",
"SHR_1_brain_total_RNA_cDNA_GCCAAT", "SHR_2_brain_total_RNA_cDNA_TGACCA",
"ACI-SegHsd-2-brain-total-RNA_S17", "BXH2-3-brain-total-RNA_S4",
"BXH5-3-brain-total-RNA_S3", "BXH8-3-brain-total-RNA_S5", "Cop-CrCrl-2-brain-total-RNA_S10",
"Dark-Agouti-1-brain-total-RNA_S16", "Dark-Agouti-2-brain-total-RNA_S13",
"F344-NCI-1-brain-total-RNA_S18", "F344-NCI-2-brain-total-RNA_S15"
), batch = c("batch1", "batch1", "batch1", "batch1", "batch1",
"batch1", "batch1", "batch1", "batch1", "batch1", "batch1", "batch1",
"batch1", "batch1", "batch1", "batch1", "batch10", "batch10",
"batch10", "batch10", "batch10", "batch10", "batch10", "batch10",
"batch10"), ID = c("BXH12_1", "BXH12_2", "HXB13_1", "HXB13_2",
"HXB17_1", "HXB17_2", "HXB2_1", "HXB2_2", "HXB25_1", "HXB25_2",
"HXB27_1", "HXB27_2", "HXB7_1", "HXB7_2", "SHR_1", "SHR_2", "ACI-SegHsd_2",
"BXH2_3", "BXH5_3", "BXH8_3", "Cop-CrCrl_2", "Dark-Agouti_1",
"Dark-Agouti_2", "F344-NCI_1", "F344-NCI_2")), row.names = c(NA,
25L), class = "data.frame")
如果所有样本都包含
\u brain
或-brain
,并且您希望在保存之前保存这些内容,您可以执行以下操作:
names=c(“ABBA-1_2-brain-total2”、“BABBA-2_2-brain-total2”、“ARA_1-1_2-brain-total2”)
gsub(“(.*.brain.*”,“\\1”,名称)
#>[1]“ABBA-1_2”“BABBA-2_2”“ARA_1-1_2”
灵感来自@r2evans:
samples在R
中,我们还可以删除子字符串而不是捕获
sub("[-_]brain.*", "", names)
#[1] "ABBA-1_2" "BABBA-2_2" "ARA_1-1_2"
或使用trimws
trimws(names, whitespace = "[-_]brain.*")
#[1] "ABBA-1_2" "BABBA-2_2" "ARA_1-1_2"
以OP公司为例
library(dplyr)
library(stringr)
brain_ref <- brain_ref %>%
mutate(newsample = str_remove(sample, "[_-]brain.*"))
库(dplyr)
图书馆(stringr)
大脑参考%
突变(newsample=str_remove(样本,[[u-]brain.*))
数据
名称为什么不:
for name in names:
m = re.match(r'(.*)[_\-]brain[_\-]total', name)
print(m.group(1))
你能在R
中用dput
而不是imagesub(“-([0-9]+)-brain.*”、“\u1”,“SHR-1-brain-total-RNA\u S6”)
显示示例数据吗?@r2evans有时是下划线,而不是破折号。好的,sub([0-9]+)-brain.*”、“\u1”,“SHR-1-brain-total-RNA\u S6”)
(1)请不要发布代码/数据/错误的图像:它无法复制或搜索(SEO),它会破坏屏幕阅读器,并且可能不适合某些移动设备。参考:(和)。请直接包括代码或数据(例如,dput(头(x))
或data.frame(…)
)。(2) 即使你说有链接(而且没有),我还是建议链接过时,导致一个无法产生的问题。请按照akrun的建议,使用dput
或编程方式(data.frame
或vec Try“SHR-1-brain-total-RNA-S6”
)包含样本数据,应该返回SHR_1
,但不是。(请参阅几分钟前我的评论。)我不确定你的意思。我尝试了“SHR-1-brain-total-RNA-S6”
它确实返回了SHR-1
,这是它应该返回的。我看不出有理由否决投票?对不起,我想这很清楚。gsub((.*).brain.*,“\\1”,“SHR-1-brain-total-RNA-S6”)
返回“SHR-1”
,但OP显示了“SHR\u 1”
。所以当我说“应该返回”时,我是建议“SHR-1”
是错误的。由于OP编辑了问题并更改了样本和期望值,因此不再有以SHR-1
开头并期望SHR_1
的字符串,因此我愿意取消对该问题的否决票。由于so不允许我撤回否决票,除非对答案进行编辑,因此我将您的答案编辑为(非常简单)更改一点,这样我就可以收回否决票。如果你愿意,请随意回滚我的编辑。谢谢。这几乎可以工作,但在数字靠近“大脑”的情况下(例如“F344”Stm_1braintotalRNAb30_S8_L001_R1_001”)@HarrySmith查看是否现在在第二个[\u-]后面加上星号
和之前的大脑
工作。[\u-]*
现在表示模式的零次或多次重复。
names <- c("ABBA-1_2-brain-total2", "BABBA-2_2_brain-total2", "ARA_1-1_2-brain-total2")
for name in names:
m = re.match(r'(.*)[_\-]brain[_\-]total', name)
print(m.group(1))