R 如何通过提取特定行来生成变量?
我有如下数据,基因名称中包括SNP名称rs编号或c_位置,例如ABCB9。在名为c_pos000000的SNP中,范围是1到22个染色体数目R 如何通过提取特定行来生成变量?,r,bioinformatics,R,Bioinformatics,我有如下数据,基因名称中包括SNP名称rs编号或c_位置,例如ABCB9。在名为c_pos000000的SNP中,范围是1到22个染色体数目 ABCB9 rs11057374 rs7138100 c22_pos41422393 rs12309481 END ABCC10 rs1214748 END HDAC9 rs928578 rs10883039 END HCN2 rs12428035 rs9561933 c2_pos10234
ABCB9
rs11057374
rs7138100
c22_pos41422393
rs12309481
END
ABCC10
rs1214748
END
HDAC9
rs928578
rs10883039
END
HCN2
rs12428035
rs9561933
c2_pos102345
rs3848077
rs3099362
END
通过使用这些数据,我想得到如下输出
rs11057374 ABCB9
rs7138100 ABCB9
c22_pos41422393 ABCB9
rs12309481 ABCB9
rs1214748 ABCC10
rs928578 HDAC9
rs10883039 HDAC9
rs12428035 HCN2
rs9561933 HCN2
c2_pos102345 HCN2
rs3848077 HCN2
rs3099362 HCN2
是否有空白和结尾是没有必要的
如何在R或linux中生成此输出 我们可以稍微改变一下。使用readLines读取文件并删除前导/后置空格TrimW后,根据基于空值创建的分组向量拆分“lines1”,从列表元素中删除或结束字符串,然后使用每个列表元素的第一个观察值sapplylst1设置列表的名称,[,1,同时提取除第一个元素之外的所有其他元素并将其堆叠 数据
使用原始文件来获得SNP基因图谱,而不是使用处理过的文件。正如您所提到的,这些数据是以下的输出: 因此,我们已经有gene.list和mydata.map文件。使用这两个文件,我们可以执行以下操作:
library(data.table)
# gene list file
geneList <- data.table(
chr = 1:2,
start = c(10, 40),
end = c(13, 45),
gene = paste0("gene_",1:2))
# chr start end gene
# 1: 1 10 13 gene_1
# 2: 2 40 45 gene_2
# map file
map <- data.table(
chr = c(1,1,1,2,2,2,3),
snp = paste0("snp_",1:7),
cm = 0,
bp = c(10,11,15,40,41,49,100))
# prepare for merging, rename colnames to match gene list colnames
map <- map[, list(chr, start = bp, end = bp, snp)]
# chr start end snp
# 1: 1 10 10 snp_1
# 2: 1 11 11 snp_2
# 3: 1 15 15 snp_3
# 4: 2 40 40 snp_4
# 5: 2 41 41 snp_5
# 6: 2 49 49 snp_6
# 7: 3 100 100 snp_7
# set key for merging
setkey(map, chr, start, end)
# merge and susbset snp and gene columns
foverlaps(geneList, map)[, list(snp, gene)]
# snp gene
# 1: snp_1 gene_1
# 2: snp_2 gene_1
# 3: snp_4 gene_2
# 4: snp_5 gene_2
另外,请参阅更多重叠合并示例/函数。@akrun不客气,我很高兴,也感谢您的帮助。数据来自哪里,看起来像MSigDB,您能提供链接吗?@zx8754数据来自plink制作的geneset filexxx.set的输出。
lines <- readLines("yourdata.txt")
plink --file mydata --make-set gene.list --write-set
library(data.table)
# gene list file
geneList <- data.table(
chr = 1:2,
start = c(10, 40),
end = c(13, 45),
gene = paste0("gene_",1:2))
# chr start end gene
# 1: 1 10 13 gene_1
# 2: 2 40 45 gene_2
# map file
map <- data.table(
chr = c(1,1,1,2,2,2,3),
snp = paste0("snp_",1:7),
cm = 0,
bp = c(10,11,15,40,41,49,100))
# prepare for merging, rename colnames to match gene list colnames
map <- map[, list(chr, start = bp, end = bp, snp)]
# chr start end snp
# 1: 1 10 10 snp_1
# 2: 1 11 11 snp_2
# 3: 1 15 15 snp_3
# 4: 2 40 40 snp_4
# 5: 2 41 41 snp_5
# 6: 2 49 49 snp_6
# 7: 3 100 100 snp_7
# set key for merging
setkey(map, chr, start, end)
# merge and susbset snp and gene columns
foverlaps(geneList, map)[, list(snp, gene)]
# snp gene
# 1: snp_1 gene_1
# 2: snp_2 gene_1
# 3: snp_4 gene_2
# 4: snp_5 gene_2