匹配行,创建列组并按组在R中求和列组
我有一个巨大的数据集,大约30000行,17000列,还有一个字符元素向量 这是一个虚拟集,它重新创建了我的数据集匹配行,创建列组并按组在R中求和列组,r,dataframe,R,Dataframe,我有一个巨大的数据集,大约30000行,17000列,还有一个字符元素向量 这是一个虚拟集,它重新创建了我的数据集 ### Example df <- data.frame(Gene=paste0("gene", 1:60), replicate(60, runif(60, min=0, max=100))) colnames(df) <- c("GeneName", paste0("TisA.", 1:20), paste0("TisB.", 1:20), paste0("Tis
### Example
df <- data.frame(Gene=paste0("gene", 1:60), replicate(60, runif(60, min=0, max=100)))
colnames(df) <- c("GeneName", paste0("TisA.", 1:20), paste0("TisB.", 1:20), paste0("TisC.", 1:20))
genes <- sample(df$GeneName, 5)
head(df)
# GeneName TisA.1 TisA.2 TisA.3 TisA.4
#1 gene1 1.987621 17.936562 18.145417 59.43023
#2 gene2 60.031713 73.822846 93.946769 72.27633
#3 gene3 44.833748 47.890719 77.100497 39.45719
#4 gene4 44.662776 26.285659 30.087606 49.50682
#5 gene5 63.770411 6.469006 3.797708 68.17532
然而,我被困在这里了,哪种方法最快可以为基因分配等级,并将这些等级按组相加,即TisA、TisB和TisC
为了澄清,每组有20个样品TisA.1、TisA.2、…、TisA.20
期望的输出是:
GeneName TisA TisB TisC
gene4 24 32 10 ## these are random values to show sum of ranks for each of genes in the vector
gene1 14 12 20 ## these are random values to show sum of ranks for each of genes in the vector
gene40 4 92 12 ## these are random values to show sum of ranks for each of genes in the vector
gene2 64 2 40 ## these are random values to show sum of ranks for each of genes in the vector
gene15 84 32 9 ## these are random values to show sum of ranks for each of genes in the vector
p.S我的真实数据集中的一些值可以是0,并且可以使用tidyverse在不同的列中重复
# your data. Including seed to make it reproducible
set.seed(123)
df <- data.frame(Gene=paste0("gene", 1:60), replicate(60, runif(60, min=0, max=100)))
colnames(df) <- c("GeneName", paste0("TisA.", 1:20), paste0("TisB.", 1:20), paste0("TisC.", 1:20))
library(tidyverse)
as.tbl(df) %>%
gather(key, value, -GeneName) %>%
group_by(GeneName) %>%
mutate(Ranks = rank(value, ties.method = "first")) %>%
separate(key, into = c("key1", "key2"), sep = "[.]") %>%
group_by(GeneName,key1) %>%
summarise(Sum=sum(Ranks)) %>%
spread(key1, Sum)
# A tibble: 60 x 4
# Groups: GeneName [60]
GeneName TisA TisB TisC
* <fctr> <int> <int> <int>
1 gene1 698 620 512
2 gene10 525 653 652
3 gene11 631 598 601
4 gene12 487 679 664
5 gene13 688 579 563
6 gene14 674 581 575
7 gene15 618 647 565
8 gene16 696 552 582
9 gene17 656 560 614
10 gene18 543 649 638
或者尝试一个更基本的解决方案……有点复杂
df1 <- apply(df[-1], 1, rank, ties.method= "first")
df2 <- apply(df1, 2, function(x){
aggregate(x, list(sapply(strsplit(colnames(df), "[.]"), "[", 1)[-1]), sum)
})
df3 <- cbind.data.frame(df$GeneName, t(Reduce(cbind, lapply(df2, "[", 2))))
colnames(df3) <- c("GeneName", "TisA", "TisB", "TisC")
head(df3[order(df3$GeneName),])
GeneName TisA TisB TisC
gene1 698 620 512
gene10 525 653 652
gene11 631 598 601
gene12 487 679 664
gene13 688 579 563
gene14 674 581 575
你在说什么类型的团体?您的基因标记为1-60,您有60行。该组将为TisA、TisB或TisC,每个组有20个元素,例如TisA.1、TisA.2、…TisA.20感谢Jimbou,如果colnames组信息(即TisA.1、TisA.2)存储在data.frame中,并且我的数据集中的实际列将是字母和数字的组合,这将是一种更简单的方法?
# your data. Including seed to make it reproducible
set.seed(123)
df <- data.frame(Gene=paste0("gene", 1:60), replicate(60, runif(60, min=0, max=100)))
colnames(df) <- c("GeneName", paste0("TisA.", 1:20), paste0("TisB.", 1:20), paste0("TisC.", 1:20))
library(tidyverse)
as.tbl(df) %>%
gather(key, value, -GeneName) %>%
group_by(GeneName) %>%
mutate(Ranks = rank(value, ties.method = "first")) %>%
separate(key, into = c("key1", "key2"), sep = "[.]") %>%
group_by(GeneName,key1) %>%
summarise(Sum=sum(Ranks)) %>%
spread(key1, Sum)
# A tibble: 60 x 4
# Groups: GeneName [60]
GeneName TisA TisB TisC
* <fctr> <int> <int> <int>
1 gene1 698 620 512
2 gene10 525 653 652
3 gene11 631 598 601
4 gene12 487 679 664
5 gene13 688 579 563
6 gene14 674 581 575
7 gene15 618 647 565
8 gene16 696 552 582
9 gene17 656 560 614
10 gene18 543 649 638
df1 <- apply(df[-1], 1, rank, ties.method= "first")
df2 <- apply(df1, 2, function(x){
aggregate(x, list(sapply(strsplit(colnames(df), "[.]"), "[", 1)[-1]), sum)
})
df3 <- cbind.data.frame(df$GeneName, t(Reduce(cbind, lapply(df2, "[", 2))))
colnames(df3) <- c("GeneName", "TisA", "TisB", "TisC")
head(df3[order(df3$GeneName),])
GeneName TisA TisB TisC
gene1 698 620 512
gene10 525 653 652
gene11 631 598 601
gene12 487 679 664
gene13 688 579 563
gene14 674 581 575