R 使用ggplot2将显著性水平添加到矩阵相关热图中

R 使用ggplot2将显著性水平添加到矩阵相关热图中,r,ggplot2,correlation,heatmap,significance,R,Ggplot2,Correlation,Heatmap,Significance,我想知道,除了R2值(-1到1)之外,如何在矩阵相关热图中添加另一层重要且必要的复杂性,例如重要性水平恒星之后的p值? 本问题的目的不是将显著性水平星号或p值作为文本放在矩阵的每个正方形上,而是在矩阵的每个正方形上以开箱即用的图形表示显著性水平。我认为,只有那些享受创新思维祝福的人才能赢得掌声,解开这种解决方案,以便有最好的方式将复杂性的增加部分表示到我们的“真理矩阵关联热图的一半”。我在谷歌上搜索了很多次,但从来没有看到过一个合适的,或者我可以说是一个“眼睛友好”的方式来表示显著性水平加上反映

我想知道,除了R2值(-1到1)之外,如何在矩阵相关热图中添加另一层重要且必要的复杂性,例如重要性水平恒星之后的p值?
本问题的目的不是将显著性水平星号或p值作为文本放在矩阵的每个正方形上,而是在矩阵的每个正方形上以开箱即用的图形表示显著性水平。我认为,只有那些享受创新思维祝福的人才能赢得掌声,解开这种解决方案,以便有最好的方式将复杂性的增加部分表示到我们的“真理矩阵关联热图的一半”。我在谷歌上搜索了很多次,但从来没有看到过一个合适的,或者我可以说是一个“眼睛友好”的方式来表示显著性水平加上反映R系数的标准色度。
可复制数据集可在此处找到:

R代码如下所示:

library(ggplot2)
library(plyr) # might be not needed here anyway it is a must-have package I think in R 
library(reshape2) # to "melt" your dataset
library (scales) # it has a "rescale" function which is needed in heatmaps 
library(RColorBrewer) # for convenience of heatmap colors, it reflects your mood sometimes
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba <- as.data.frame(cor(nba[2:ncol(nba)])) # convert the matrix correlations to a dataframe 
nba.m <- data.frame(row=rownames(nba),nba) # create a column called "row"
rownames(nba) <- NULL #get rid of row names
nba <- melt(nba)
nba.m$value<-cut(nba.m$value,breaks=c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1),include.lowest=TRUE,label=c("(-0.75,-1)","(-0.5,-0.75)","(-0.25,-0.5)","(0,-0.25)","(0,0.25)","(0.25,0.5)","(0.5,0.75)","(0.75,1)")) # this can be customized to put the correlations in categories using the "cut" function with appropriate labels to show them in the legend, this column now would be discrete and not continuous
nba.m$row <- factor(nba.m$row, levels=rev(unique(as.character(nba.m$variable)))) # reorder the "row" column which would be used as the x axis in the plot after converting it to a factor and ordered now
#now plotting
ggplot(nba.m, aes(row, variable)) +
geom_tile(aes(fill=value),colour="black") +
scale_fill_brewer(palette = "RdYlGn",name="Correlation")  # here comes the RColorBrewer package, now if you ask me why did you choose this palette colour I would say look at your battery charge indicator of your mobile for example your shaver, won't be red when gets low? and back to green when charged? This was the inspiration to choose this colour set.
库(ggplot2)
图书馆(plyr)#无论如何,这里可能不需要它。我认为这是R中的必备软件包
库(重塑2)#用于“融化”数据集
库(缩放)#它具有热图所需的“重缩放”功能
图书馆(RColorBrewer)#为了方便使用热图颜色,它有时会反映你的情绪

nba这只是为了提高最终解决方案,我在这里绘制了星星作为解决方案的指示器,但正如我所说的,目的是找到一个比星星更好的图形解决方案。我刚刚用geom_point和alpha来表示显著性水平,但NAs(也包括非显著性值)会像三星级的显著性水平一样出现问题,如何解决这个问题?我认为,当使用多种颜色时,使用一种颜色可能会更利于眼睛,并避免在情节中增加许多细节以供眼睛解决。提前感谢。
以下是我第一次尝试的情节:

还是这样更好

我认为到目前为止最好的是下面的,直到你想出更好的办法!

根据要求,以下代码适用于最后一张热图:

# Function to get the probability into a whole matrix not half, here is Spearman you can change it to Kendall or Pearson
cor.prob.all <- function (X, dfr = nrow(X) - 2) {
R <- cor(X, use="pairwise.complete.obs",method="spearman")
r2 <- R^2
Fstat <- r2 * dfr/(1 - r2)
R<- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}
# Change matrices to dataframes
nbar<- as.data.frame(cor(nba[2:ncol(nba)]),method="spearman") # to a dataframe for r^2
nbap<- as.data.frame(cor.prob.all(nba[2:ncol(nba)])) # to a dataframe for p values
# Reset rownames
nbar <- data.frame(row=rownames(nbar),nbar) # create a column called "row" 
rownames(nbar) <- NULL
nbap <- data.frame(row=rownames(nbap),nbap) # create a column called "row" 
rownames(nbap) <- NULL
# Melt
nbar.m <- melt(nbar)
nbap.m <- melt(nbap)
# Classify (you can classify differently for nbar and for nbap also)         
nbar.m$value2<-cut(nbar.m$value,breaks=c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1),include.lowest=TRUE, label=c("(-0.75,-1)","(-0.5,-0.75)","(-0.25,-0.5)","(0,-0.25)","(0,0.25)","(0.25,0.5)","(0.5,0.75)","(0.75,1)")) # the label for the legend
nbap.m$value2<-cut(nbap.m$value,breaks=c(-Inf, 0.001, 0.01, 0.05),label=c("***", "** ", "*  ")) 
nbar.m<-cbind.data.frame(nbar.m,nbap.m$value,nbap.m$value2) # adding the p value and its cut to the first dataset of R coefficients
names(nbar.m)[5]<-paste("valuep") # change the column names of the dataframe 
names(nbar.m)[6]<-paste("signif.")
nbar.m$row <- factor(nbar.m$row, levels=rev(unique(as.character(nbar.m$variable)))) # reorder the variable factor
# Plotting the matrix correlation heatmap
# Set options for a blank panel
po.nopanel <-list(opts(panel.background=theme_blank(),panel.grid.minor=theme_blank(),panel.grid.major=theme_blank()))
pa<-ggplot(nbar.m, aes(row, variable)) +
geom_tile(aes(fill=value2),colour="white") +
scale_fill_brewer(palette = "RdYlGn",name="Correlation")+ # RColorBrewer package
opts(axis.text.x=theme_text(angle=-90))+
po.nopanel
pa # check the first plot
# Adding the significance level stars using geom_text 
pp<- pa +
geom_text(aes(label=signif.),size=2,na.rm=TRUE) # you can play with the size
# Workaround for the alpha aesthetics if it is good to represent significance level, the same workaround can be applied for size aesthetics in ggplot2 as well. Applying the alpha aesthetics to show significance is a little bit problematic, because we want the alpha to be low while the p value is high, and vice verse which can't be done without a workaround
nbar.m$signif.<-rescale(as.numeric(nbar.m$signif.),to=c(0.1,0.9)) # I tried to use to=c(0.1,0.9) argument as you might expect, but to avoid problems with the next step of reciprocal values when dividing over one, this is needed for the alpha aesthetics as a workaround
nbar.m$signif.<-as.factor(0.09/nbar.m$signif.) # the alpha now behaves as wanted  except for the NAs values stil show as if with three stars level, how to fix that?
# Adding the alpha aesthetics in geom_point in a shape of squares (you can improve here)
pp<- pa +
geom_point(data=nbar.m,aes(alpha=signif.),shape=22,size=5,colour="darkgreen",na.rm=TRUE,legend=FALSE) # you can remove this step, the result of this step is seen in one of the layers in the above green heatmap, the shape used is 22 which is again a square but the size you can play with it accordingly  
#函数将概率转化为整个矩阵而不是一半,这里是斯皮尔曼,您可以将其更改为肯德尔或皮尔森

cor.prob.all要表示沿估计相关系数的显著性,您可以改变着色量-使用
alpha
或仅填充每个瓷砖的子集:

# install.packages("fdrtool")
# install.packages("data.table")
library(ggplot2)
library(data.table)

#download dataset
nba <- as.matrix(read.csv("http://datasets.flowingdata.com/ppg2008.csv")[-1])
m <- ncol(nba)
# compute corellation and p.values for all combinations of columns
dt <- CJ(i=seq_len(m), j=seq_len(m))[i<j]
dt[, c("p.value"):=(cor.test(nba[,i],nba[,j])$p.value), by=.(i,j)]
dt[, c("corr"):=(cor(nba[,i],nba[,j])), by=.(i,j)]

# estimate local false discovery rate
dt[,lfdr:=fdrtool::fdrtool(p.value, statistic="pvalue")$lfdr]

dt <- rbind(dt, setnames(copy(dt),c("i","j"),c("j","i")), data.table(i=seq_len(m),j=seq_len(m), corr=1, p.value=0, lfdr=0))


#use alpha
ggplot(dt, aes(x=i,y=j, fill=corr, alpha=1-lfdr)) + 
  geom_tile()+
  scale_fill_distiller(palette = "RdYlGn", direction=1, limits=c(-1,1),name="Correlation") +
  scale_x_continuous("variable", breaks = seq_len(m), labels = colnames(nba)) +
  scale_y_continuous("variable", breaks = seq_len(m), labels = colnames(nba), trans="reverse") +
  coord_fixed() +
  theme(axis.text.x=element_text(angle=90, vjust=0.5),
        panel.background=element_blank(),
        panel.grid.minor=element_blank(),
        panel.grid.major=element_blank(),
  )

这里的关键是p值的缩放:为了获得仅在相关区域显示较大变化的易于解释的值,我使用了由
fdrtools
提供的局部错误发现(lfdr)的上界估计值。 即,平铺的alpha值可能小于或等于该相关性不同于0的概率。

库(“corrplot”)
library("corrplot")
nba <- as.matrix(read.csv("https://raw.githubusercontent.com/Shicheng-Guo/Shicheng-Guo.Github.io/master/data/ppg2008.csv")[-1])
res1 <- cor.mtest(nba, conf.level = .95)
par(mfrow=c(2,2))

# correlation and P-value
corrplot(cor(nba), p.mat = res1$p, insig = "label_sig",sig.level = c(.001, .01, .05), pch.cex = 0.8, pch.col = "white",tl.cex=0.8)

# correlation and hclust
corrplot(cor(nba), method = "shade", outline = T, addgrid.col = "darkgray", order="hclust", 
         mar = c(4,0,4,0), addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", 
         tl.cex = 0.8, cl.cex = 0.8)

nba你的代码启发我用ggplot2重写
arm::corrplot
函数:它工作得很好!您能否扩展此函数以产生那些不显著的相关性(例如,感谢您的反馈。我对此处(和其他地方)使用$p$-值表示怀疑),但我将尝试找出一些标记不重要系数的方法。上面引用的函数现在是包的一部分,由包的维护人员进行更正和添加。其中是带c(-1,-0.75,-0.5,-0.25,0,0.25,0.5,0.75,1)的(-1,-0.75)颜色,我们应该有8个间隔和8种颜色,而不是7…奇妙的情节,但人们应该意识到,在相关系数热图的上下文中,p值可能并不意味着他们所期望的(或任何东西)。如果你有这么多可能的相关性($n^2-n/2$),甚至大于.99的p值实际上开始变得有些可能。在此上下文中过度依赖p值可能被另一个名称视为p-hacking。这很好地解释了这一点。opts()早就被弃用了。请使用主题()选项。
#use area
ggplot(dt, aes(x=i,y=j, fill=corr,  height=sqrt(1-lfdr),  width=sqrt(1-lfdr))) + 
  geom_tile()+
  scale_fill_distiller(palette = "RdYlGn", direction=1, limits=c(-1,1),name="Correlation") +
  scale_color_distiller(palette = "RdYlGn", direction=1, limits=c(-1,1),name="Correlation") +
  scale_x_continuous("variable", breaks = seq_len(m), labels = colnames(nba)) +
  scale_y_continuous("variable", breaks = seq_len(m), labels = colnames(nba), trans="reverse") +
  coord_fixed() +
  theme(axis.text.x=element_text(angle=90, vjust=0.5),
        panel.background=element_blank(),
        panel.grid.minor=element_blank(),
        panel.grid.major=element_blank(),
  )
library("corrplot")
nba <- as.matrix(read.csv("https://raw.githubusercontent.com/Shicheng-Guo/Shicheng-Guo.Github.io/master/data/ppg2008.csv")[-1])
res1 <- cor.mtest(nba, conf.level = .95)
par(mfrow=c(2,2))

# correlation and P-value
corrplot(cor(nba), p.mat = res1$p, insig = "label_sig",sig.level = c(.001, .01, .05), pch.cex = 0.8, pch.col = "white",tl.cex=0.8)

# correlation and hclust
corrplot(cor(nba), method = "shade", outline = T, addgrid.col = "darkgray", order="hclust", 
         mar = c(4,0,4,0), addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", 
         tl.cex = 0.8, cl.cex = 0.8)