有没有更好的方法在R中实现这一点?
我朋友今天给了我一个文件,看起来像这样:有没有更好的方法在R中实现这一点?,r,database,sorting,R,Database,Sorting,我朋友今天给了我一个文件,看起来像这样: ID genotype snp.id 1 PT86 CA 192902098 2 PT8 CA 192902098 3 PT33 TC 191571437 4 PT27 GA 191026838 5 PT2 TG 188482874 6 PT1 GC 18
ID genotype snp.id
1 PT86 CA 192902098
2 PT8 CA 192902098
3 PT33 TC 191571437
4 PT27 GA 191026838
5 PT2 TG 188482874
6 PT1 GC 186443061
7 PT70 GC 186443061
8 PT59 GA 185444226
9 PT48 GA 185152161
10 PT54 GA 185152161
11 PT18 GA 185152161
12 PT27 GA 185152161
id rs185152161 rs185444226 rs186443061 rs188482874 rs191026838 rs191571437 rs192902098
1 PT1 <NA> <NA> GC <NA> <NA> <NA> <NA>
2 PT18 GA <NA> <NA> <NA> <NA> <NA> <NA>
3 PT2 <NA> <NA> <NA> TG <NA> <NA> <NA>
and so on....
实际上,数据有近1000行,我这里只显示了12行的样本
他问我是否可以将此文件转换为以下格式:
ID genotype snp.id
1 PT86 CA 192902098
2 PT8 CA 192902098
3 PT33 TC 191571437
4 PT27 GA 191026838
5 PT2 TG 188482874
6 PT1 GC 186443061
7 PT70 GC 186443061
8 PT59 GA 185444226
9 PT48 GA 185152161
10 PT54 GA 185152161
11 PT18 GA 185152161
12 PT27 GA 185152161
id rs185152161 rs185444226 rs186443061 rs188482874 rs191026838 rs191571437 rs192902098
1 PT1 <NA> <NA> GC <NA> <NA> <NA> <NA>
2 PT18 GA <NA> <NA> <NA> <NA> <NA> <NA>
3 PT2 <NA> <NA> <NA> TG <NA> <NA> <NA>
and so on....
然后,我用每个snp.id提取数据子集,并将其放入列表矩阵中
mat=matrix(list(),ncol=1,nrow=13)
for (i in 1:7) {
mat[[i,1]]=subset(raw,snp.id==snpids[[i]])[,1:2]
names(mat[[i,1]])=c('id',paste("rs",snpids[[i]],sep=""))
}
然后我合并了我提取的所有数据帧
df1= Reduce(function(x,y) merge(x,y,all=T),mat[1:7,1])
df2=df1[!duplicated(df1$id),]
所以数据看起来像
id rs185152161 rs185444226 rs186443061 rs188482874 rs191026838 rs191571437 rs192902098
1 PT1 <NA> <NA> GC <NA> <NA> <NA> <NA>
2 PT18 GA <NA> <NA> <NA> <NA> <NA> <NA>
3 PT2 <NA> <NA> <NA> TG <NA> <NA> <NA>
4 PT27 GA <NA> <NA> <NA> GA <NA> <NA>
5 PT33 <NA> <NA> <NA> <NA> <NA> TC <NA>
6 PT48 GA <NA> <NA> <NA> <NA> <NA> <NA>
7 PT54 GA <NA> <NA> <NA> <NA> <NA> <NA>
8 PT59 <NA> GA <NA> <NA> <NA> <NA> <NA>
9 PT70 <NA> <NA> GC <NA> <NA> <NA> <NA>
10 PT8 <NA> <NA> <NA> <NA> <NA> <NA> CA
11 PT86 <NA> <NA> <NA> <NA> <NA> <NA> CA
id rs185152161 rs185444226 rs186443061 rs188482874 rs191026838 rs191571437 rs192902098
1 PT1 GC
2 PT18 GA
3pt2tg
4 PT27 GA
5 PT33 TC
6 PT48 GA
7 PT54 GA
8 PT59 GA
9 PT70 GC
10 PT8钙
11 PT86 CA
我想知道在不使用这些循环函数的情况下是否有更好的方法来执行此操作?尝试:(dat
是数据集)
对于从长格式到宽格式的转换,始终有
重塑
library(reshape)
reshape(data, idvar = "ID", timevar = "snp.id", direction = "wide")
因此:
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| ID | genotype.192902098 | genotype.191571437 | genotype.191026838 | genotype.188482874 | genotype.186443061 | genotype.185444226 | genotype.185152161 |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT86 | CA | NA | NA | NA | NA | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT8 | CA | NA | NA | NA | NA | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT33 | NA | TC | NA | NA | NA | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT27 | NA | NA | GA | NA | NA | NA | GA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT2 | NA | NA | NA | TG | NA | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT1 | NA | NA | NA | NA | GC | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT70 | NA | NA | NA | NA | GC | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT59 | NA | NA | NA | NA | NA | GA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT48 | NA | NA | NA | NA | NA | NA | GA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT54 | NA | NA | NA | NA | NA | NA | GA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT18 | NA | NA | NA | NA | NA | NA | GA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
在基因型列中,我得到的是数字而不是基因型。我还收到以下警告消息:“聚合函数丢失:默认为长度”@dOctOr这可能是因为每个组合有多个条目。我的代码基于您提供的示例。你能展示一个显示警告信息的非常小的示例数据吗?@dOctOr I针对每个组合有多个条目的病例进行了更新
dat$indx <- with(dat, ave(seq_along(ID), ID, FUN=seq_along))
dcast(dat, ID+indx~snp.id, value.var="genotype")[,-2]
library(reshape)
reshape(data, idvar = "ID", timevar = "snp.id", direction = "wide")
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| ID | genotype.192902098 | genotype.191571437 | genotype.191026838 | genotype.188482874 | genotype.186443061 | genotype.185444226 | genotype.185152161 |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT86 | CA | NA | NA | NA | NA | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT8 | CA | NA | NA | NA | NA | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT33 | NA | TC | NA | NA | NA | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT27 | NA | NA | GA | NA | NA | NA | GA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT2 | NA | NA | NA | TG | NA | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT1 | NA | NA | NA | NA | GC | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT70 | NA | NA | NA | NA | GC | NA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT59 | NA | NA | NA | NA | NA | GA | NA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT48 | NA | NA | NA | NA | NA | NA | GA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT54 | NA | NA | NA | NA | NA | NA | GA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| PT18 | NA | NA | NA | NA | NA | NA | GA |
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+