如何使用FactoMineR包以编程方式确定主成分的列索引?

如何使用FactoMineR包以编程方式确定主成分的列索引?,r,cluster-analysis,pca,feature-selection,unsupervised-learning,R,Cluster Analysis,Pca,Feature Selection,Unsupervised Learning,给定包含混合变量(即分类变量和连续变量)的数据帧,如 变量df.princomp是一个列表 然后,将我使用的主要组件可视化 fviz_screeplot()和fviz_contrib()类似 #library(factoextra) factoextra::fviz_screeplot(df.princomp, addlabels = TRUE, barfill = "gray", barcolor = "black",

给定包含混合变量(即分类变量和连续变量)的数据帧,如

变量
df.princomp
是一个列表

然后,将我使用的主要组件可视化
fviz_screeplot()
fviz_contrib()
类似

#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = 1, top = 10, sort.val = c("desc"))
它给出了下面的图1

图二

图1的说明:图1是一个树状图。Scree图是一种简单的线段图,显示数据中总方差的分数,如各主成分(PC)所解释或表示的。因此,我们可以看到前三个PC共同负责总方差的
43.8%
。现在,问题自然而然地出现了,“这些变量是什么?”。这一点我已经在图2中展示了

图2解释:此图显示了主成分分析(PCA)结果中行/列的贡献。从这里我可以看到变量,
name
studLoc
finalMark
是可用于进一步分析的最重要变量

进一步分析-我所处的位置:
name
studLoc
finalMark
,得出上述变量的贡献。我使用主成分变量
df.princomp
(见上文)如
df.princomp$quanti.var$contrib[,4]
df.princomp$quali.var$contrib[,2:3]

我必须手动指定列索引
[,2:3]
[,4]

我想要什么:我想知道如何进行动态列索引分配,这样我就不必手动编码列表中的列索引
[,2:3]


我已经研究了以下类似的问题,但找不到我的解决方案?解决这个问题的任何帮助或建议都会很有帮助。

有很多方法可以提取个人变量对PC的贡献。对于数字输入,可以使用
prcomp
运行PCA,并查看
$rotation
(我很快谈到了,但忘了这里有一些因素,所以
prcomp
无法直接工作)。由于您使用的是
factoextra::fviz_contrib
,因此检查该函数如何在引擎盖下提取此信息是有意义的。键
factoextra::fviz_contrib
并读取函数:

> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var", 
    "quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue", 
    color = "steelblue", sort.val = c("desc", "asc", "none"), 
    top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(), 
    ...) 
{
    sort.val <- match.arg(sort.val)
    choice = match.arg(choice)
    title <- .build_title(choice[1], "Contribution", axes)
    dd <- facto_summarize(X, element = choice, result = "contrib", 
        axes = axes)
    contrib <- dd$contrib
    names(contrib) <- rownames(dd)
    theo_contrib <- 100/length(contrib)
    if (length(axes) > 1) {
        eig <- get_eigenvalue(X)[axes, 1]
        theo_contrib <- sum(theo_contrib * eig)/sum(eig)
    }
    df <- data.frame(name = factor(names(contrib), levels = names(contrib)), 
        contrib = contrib)
    if (choice == "quanti.var") {
        df$Groups <- .get_quanti_var_groups(X)
        if (missing(fill)) 
            fill <- "Groups"
        if (missing(color)) 
            color <- "Groups"
    }
    p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill, 
        color = color, sort.val = sort.val, top = top, main = title, 
        xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt, 
        ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib, 
        linetype = 2, color = "red")
    p
}
<environment: namespace:factoextra>
这是对应于图2的表格。对于PC2,使用轴=2等

关于“如何以编程方式确定PC的列索引”,我不能100%确定我是否理解您的要求,但如果您只想说“finalmark”列,请抓住它对PC3的贡献,您可以执行以下操作:

library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")

# get the contribution of column 'finalmark' by name
contribution_df %>%
  filter(name == "finalmark")

# get the contribution of column 'finalmark' to PC3
contribution_df %>%
  filter(name == "finalmark" & PC == 3)

# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
库(tidyverse)
#将原始df中的所有列名及其对所有PC的贡献制作一个整洁的表格
出资额_df%
过滤器(名称=“最终标记”)
#获取“finalmark”列对PC3的贡献
贡献_df%>%
过滤器(名称=“finalmark”&PC==3)
#或者,只是贡献的数值
筛选(贡献度,名称=“最终标记”&个人计算机==3)$contrib

顺便说一句,我认为在你的例子中,
ID
被视为数字,而不是因子,但因为这只是一个例子,我不想麻烦它。

不确定我对你的问题的解释是否正确,如果不正确,请道歉。从我收集的信息来看,您正在使用PCA作为初始工具,向您展示在解释数据集时哪些变量最重要。然后,您希望返回原始数据,快速选择这些变量,而无需每次手动编码,并将其用于其他分析

如果这是正确的,那么我保存了贡献图中的数据,过滤出贡献最大的变量,并使用该结果创建了一个仅包含这些变量的新数据框

digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than, say, 20

r<-rownames(dat[dat$contrib>20,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

#finalmark name    studLoc
#1         53    b POTYQ0002N
#2         73    i LWMTW1195I
#3         95    d VTUGO1685F
#4         39    f YCGGS5755N
#5         97    c GOSWE3283C
#6         58    g APBQD6181U
#7         67    a VUJOG1460V
#8         64    h YXOGP1897F
#9         15    j NFUOB6042V
#10        81    e QYTHG0783G
数字=0:9
#为再现性设定种子
种子(17)
#函数创建随机字符串

createRandString那么,为了清楚起见,这里想要的结果到底是什么?图2中的立柱高度?@MikkoMarttila感谢您的关注。这很有帮助。我已经更新了这个问题,以便其他人能更好地理解它(不仅仅是我:)。我相信这是一个微不足道的答案,但我似乎无法理解。所以你是说你想要一些函数
f(pc1,pc2)
,如果pc1是第二个组件,pc2是第三个组件,那么你会得到
df.princomp$quanti.var$contrib[,2:3]
df.princomp$quali.var$contrib[,2:3]
,您的代码不是完全可复制的,它需要一个随机种子。当我运行它时,我会稍微有所不同pcas@Ashish您是否认为,
df.princomp$quanti.var$contrib[,4]
是finalMark对Dim-1的贡献?如果是这样,我认为这种解释是不正确的,相反,finalMark对Dim-1的贡献可以在这里找到
df.princomp$quanti.var$contrib[“finalMark”,“Dim.1”]
,同样,name和studLoc对Dim-1的贡献可以在
df.princomp$quali.var$contrib[1:10,“Dim.1”]>%sum()找到
df.princomp$quali.var$contrib[11:20,“Dim.1”]%>%sum()
。注意,这些是上面图2的前三列。感谢您提供查看函数结构的指针。这很有帮助。请编辑您的答案(如果可能的话),包括“提取个人变量对个人电脑贡献的多种方法”。这将对我和其他感兴趣的人非常有帮助。接下来,您能否详细说明变量
dd
中包含的“如何以编程方式确定PCs的列索引”,并将其映射到原始数据帧?谢谢。更新了答案。这有用吗?我想我还是不完全明白你的意思
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
               name    contrib
ID               ID  0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark  7.1874438
subj2mark subj2mark 16.6831560
name           name 26.8610132
studLoc     studLoc 26.8610132
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")

# get the contribution of column 'finalmark' by name
contribution_df %>%
  filter(name == "finalmark")

# get the contribution of column 'finalmark' to PC3
contribution_df %>%
  filter(name == "finalmark" & PC == 3)

# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
                 studLoc=sample(createRandString(10)),
                 finalmark=sample(c(0:100),10),
                 subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)

df.princomp <- FactoMineR::FAMD(df, graph = FALSE)

factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than, say, 20

r<-rownames(dat[dat$contrib>20,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new

#finalmark name    studLoc
#1         53    b POTYQ0002N
#2         73    i LWMTW1195I
#3         95    d VTUGO1685F
#4         39    f YCGGS5755N
#5         97    c GOSWE3283C
#6         58    g APBQD6181U
#7         67    a VUJOG1460V
#8         64    h YXOGP1897F
#9         15    j NFUOB6042V
#10        81    e QYTHG0783G
#top contributors to both Dim 1 and 2

f<-factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = c(1,2), top = 10, sort.val = c("desc"))

#save data from contribution plot
dat<-f$data

#filter out ID's that are higher than 5

r<-rownames(dat[dat$contrib>5,])

#extract these from your original data frame into a new data frame for further analysis

new<-df[r]

new