R 为什么该函数不为每组簇列的得分列提供最高值,即排名靠前?
我有一个R 为什么该函数不为每组簇列的得分列提供最高值,即排名靠前?,r,rank,tapply,R,Rank,Tapply,我有一个dataframedt,如下所示 kmeans sd1 sd2 score gene B4GALNT1 1 1.138399 0.9302788 0.59238585 B4GALNT1 GATA2 1 1.31817 0.9869005 0.70160114 GATA2 KBTBD8 1 0.2799195 0.25295
dataframe
dt,如下所示
kmeans sd1 sd2 score gene
B4GALNT1 1 1.138399 0.9302788 0.59238585 B4GALNT1
GATA2 1 1.31817 0.9869005 0.70160114 GATA2
KBTBD8 1 0.2799195 0.25295 2.56658313 KBTBD8
LYPD6 1 0.5885738 0.5277333 1.1797581 LYPD6
MSX1 1 0.2846179 0.5276349 1.31276755 MSX1
NAP1L2 1 0.5778767 0.5252137 1.29646305 NAP1L2
PLA2G4C 1 1.545634 0.3505845 1.02694161 PLA2G4C
SLC6A15 1 3.6862153 1.7656347 0.31940624 SLC6A15
SNORA9 1 49.5847239 23.059789 0.01679016 SNORA9
STX1A 1 4.753248 2.3649298 0.17053974 STX1A
TRNP1 1 54.1230886 19.7797807 0.01907904 TRNP1
AKAP6 2 2.7115279 0.1346139 1.12646609 AKAP6
C1QL3 2 3.1646016 0.3646613 0.78840387 C1QL3
CAMK2N1 2 48.4399203 3.628805 0.05655038 CAMK2N1
CDK5R1 2 3.3858407 0.2249831 0.6292364 CDK5R1
CLSTN2 2 1.0131585 0.162797 1.96050927 CLSTN2
CNTN1 2 3.7191809 0.253088 0.83650197 CNTN1
DGKG 2 0.4607949 0.2333855 1.70445926 DGKG
DPF1 2 1.6369965 0.1873143 1.07265653 DPF1
FAM131A 2 8.7092498 1.763698 0.11250896 FAM131A
我打算通过对kmeans
列中带组的行进行排序,并根据列score
提取每个kmeans
组内的排名,以下面的顺序生成下表。所以它应该如下所示
kmeans sd1 sd2 score gene
B4GALNT1 1 1.138399 0.9302788 0.59238585 B4GALNT1
GATA2 1 1.31817 0.9869005 0.70160114 GATA2
KBTBD8 1 0.2799195 0.25295 2.56658313 KBTBD8
LYPD6 1 0.5885738 0.5277333 1.1797581 LYPD6
MSX1 1 0.2846179 0.5276349 1.31276755 MSX1
NAP1L2 1 0.5778767 0.5252137 1.29646305 NAP1L2
PLA2G4C 1 1.545634 0.3505845 1.02694161 PLA2G4C
SLC6A15 1 3.6862153 1.7656347 0.31940624 SLC6A15
SNORA9 1 49.5847239 23.059789 0.01679016 SNORA9
STX1A 1 4.753248 2.3649298 0.17053974 STX1A
TRNP1 1 54.1230886 19.7797807 0.01907904 TRNP1
AKAP6 2 2.7115279 0.1346139 1.12646609 AKAP6
C1QL3 2 3.1646016 0.3646613 0.78840387 C1QL3
CAMK2N1 2 48.4399203 3.628805 0.05655038 CAMK2N1
CDK5R1 2 3.3858407 0.2249831 0.6292364 CDK5R1
CLSTN2 2 1.0131585 0.162797 1.96050927 CLSTN2
CNTN1 2 3.7191809 0.253088 0.83650197 CNTN1
DGKG 2 0.4607949 0.2333855 1.70445926 DGKG
DPF1 2 1.6369965 0.1873143 1.07265653 DPF1
FAM131A 2 8.7092498 1.763698 0.11250896 FAM131A
期望输出:
kmeans sd1 sd2 score gene rank
B4GALNT1 1 1.138399 0.9302788 0.59238585 B4GALNT1 7
GATA2 1 1.31817 0.9869005 0.70160114 GATA2 6
KBTBD8 1 0.2799195 0.25295 2.56658313 KBTBD8 1
LYPD6 1 0.5885738 0.5277333 1.1797581 LYPD6 4
MSX1 1 0.2846179 0.5276349 1.31276755 MSX1 2
NAP1L2 1 0.5778767 0.5252137 1.29646305 NAP1L2 3
PLA2G4C 1 1.545634 0.3505845 1.02694161 PLA2G4C 5
SLC6A15 1 3.6862153 1.7656347 0.31940624 SLC6A15 8
SNORA9 1 49.5847239 23.059789 0.01679016 SNORA9 11
STX1A 1 4.753248 2.3649298 0.17053974 STX1A 9
TRNP1 1 54.1230886 19.7797807 0.01907904 TRNP1 10
AKAP6 2 2.7115279 0.1346139 1.12646609 AKAP6 3
C1QL3 2 3.1646016 0.3646613 0.78840387 C1QL3 6
CAMK2N1 2 48.4399203 3.628805 0.05655038 CAMK2N1 9
CDK5R1 2 3.3858407 0.2249831 0.6292364 CDK5R1 7
CLSTN2 2 1.0131585 0.162797 1.96050927 CLSTN2 1
CNTN1 2 3.7191809 0.253088 0.83650197 CNTN1 5
DGKG 2 0.4607949 0.2333855 1.70445926 DGKG 2
DPF1 2 1.6369965 0.1873143 1.07265653 DPF1 4
FAM131A 2 8.7092498 1.763698 0.11250896 FAM131A 8
但这不是我在应用下面代码时得到的结果
dt$rank <- unlist(with(dt, tapply(score, kmeans, function(x) rank(x,ties.method= "first"))))
dt$rank我们可以用ave
而不是tapply
来实现这一点。ave
的优点是,它在获取输出时将保持行的原始顺序
dt$rank <- with(dt, ave(-score, kmeans, FUN = function(x) rank(x, ties.method = "first")))
dt$rank
#[1] 7 6 1 4 2 3 5 8 11 9 10 3 6 9 7 1 5 2 4 8
数据
dt我认为在您的预期输出中有些排名不正确。例如,kmeans 2的排名“9”和“8”,您可以在哪里突出显示?分数列很好,我提到的排名可能不正确,但想法是在每个kemeans组内按分数列进行排名,其中排名1应为kemans列的最高分数。我指的是CAMK2N1 8
和FAM131A 9
的排名。如果是另一种方式,是的,对不起,我会重新排序。谢谢@akrun,非常抱歉,我们的预期结果一团糟。但我想我能传达我的信息。我编辑。我自己对它们进行排名只是为了表明我的意图。ave没有按照我所寻求的顺序进行排名。理想情况下,它应该根据得分列为kmeans=1的得分最高的列,为kmeans=1生成排名,其他kemans=2,3,。。。但是ave一个不是这样的。@vchris__ngs我不知道为什么它是不同的,因为我奇怪地得到了预期的输出,即使我不理解为什么它不应该。事实上,我的数据框更大,列更多,并且有行名,这正是列gene
,但这不应该破坏这种安排。对吗?我可以对列进行子集划分并尝试查看。让我检查一下我的是3.3.1,不确定发生了什么。我正在挖掘,但谢谢你的投入。在这种情况下,您的代码是正确的。我会接受的。我用上校的名字重新命名。在我的实际df
中,我有一个colnames
obj.down.4$kmeans$cluster
,这是应该进行groupby的kmeamns
列。我将其更改为kmeans
,并执行了操作,它工作得非常完美。不确定这是否破坏了它。可能这就是问题所在,但不确定。
dt <- structure(list(kmeans = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), sd1 = c(1.138399,
1.31817, 0.2799195, 0.5885738, 0.2846179, 0.5778767, 1.545634,
3.6862153, 49.5847239, 4.753248, 54.1230886, 2.7115279, 3.1646016,
48.4399203, 3.3858407, 1.0131585, 3.7191809, 0.4607949, 1.6369965,
8.7092498), sd2 = c(0.9302788, 0.9869005, 0.25295, 0.5277333,
0.5276349, 0.5252137, 0.3505845, 1.7656347, 23.059789, 2.3649298,
19.7797807, 0.1346139, 0.3646613, 3.628805, 0.2249831, 0.162797,
0.253088, 0.2333855, 0.1873143, 1.763698), score = c(0.59238585,
0.70160114, 2.56658313, 1.1797581, 1.31276755, 1.29646305, 1.02694161,
0.31940624, 0.01679016, 0.17053974, 0.01907904, 1.12646609, 0.78840387,
0.05655038, 0.6292364, 1.96050927, 0.83650197, 1.70445926, 1.07265653,
0.11250896), gene = c("B4GALNT1", "GATA2", "KBTBD8", "LYPD6",
"MSX1", "NAP1L2", "PLA2G4C", "SLC6A15", "SNORA9", "STX1A", "TRNP1",
"AKAP6", "C1QL3", "CAMK2N1", "CDK5R1", "CLSTN2", "CNTN1", "DGKG",
"DPF1", "FAM131A")), .Names = c("kmeans", "sd1", "sd2", "score",
"gene"), class = "data.frame", row.names = c("B4GALNT1", "GATA2",
"KBTBD8", "LYPD6", "MSX1", "NAP1L2", "PLA2G4C", "SLC6A15", "SNORA9",
"STX1A", "TRNP1", "AKAP6", "C1QL3", "CAMK2N1", "CDK5R1", "CLSTN2",
"CNTN1", "DGKG", "DPF1", "FAM131A"))