如何使用FactoMineR包以编程方式确定主成分的列索引？_R_Cluster Analysis_Pca_Feature Selection_Unsupervised Learning

如何使用FactoMineR包以编程方式确定主成分的列索引？

如何使用FactoMineR包以编程方式确定主成分的列索引？,r,cluster-analysis,pca,feature-selection,unsupervised-learning,R,Cluster Analysis,Pca,Feature Selection,Unsupervised Learning,给定包含混合变量（即分类变量和连续变量）的数据帧，如变量df.princomp是一个列表然后，将我使用的主要组件可视化 fviz_screeplot（）和fviz_contrib（）类似 #library(factoextra) factoextra::fviz_screeplot(df.princomp, addlabels = TRUE, barfill = "gray", barcolor = "black",

给定包含混合变量（即分类变量和连续变量）的数据帧，如

变量

df.princomp

是一个列表

然后，将我使用的主要组件可视化

fviz_screeplot（）

和

fviz_contrib（）

类似

#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
                           barfill = "gray", barcolor = "black",
                           ylim = c(0, 50), xlab = "Principal Component", 
                           ylab = "Percentage of explained variance",
                           main = "Principal Component (PC) for mixed variables")

factoextra::fviz_contrib(df.princomp, choice = "var", 
                         axes = 1, top = 10, sort.val = c("desc"))

它给出了下面的图1

图二

图1的说明：图1是一个树状图。Scree图是一种简单的线段图，显示数据中总方差的分数，如各主成分（PC）所解释或表示的。因此，我们可以看到前三个PC共同负责总方差的

43.8%

。现在，问题自然而然地出现了，“这些变量是什么？”。这一点我已经在图2中展示了

图2解释：此图显示了主成分分析（PCA）结果中行/列的贡献。从这里我可以看到变量，

name

、

studLoc

和

finalMark

是可用于进一步分析的最重要变量

进一步分析-我所处的位置：
name
，
studLoc
，
finalMark
，得出上述变量的贡献。我使用主成分变量
df.princomp
（见上文）如
df.princomp$quanti.var$contrib[，4]
和
df.princomp$quali.var$contrib[，2:3]
我必须手动指定列索引
[，2:3]
和
[，4]
我想要什么：我想知道如何进行动态列索引分配，这样我就不必手动编码列表中的列索引
[，2:3]

我已经研究了以下类似的问题，但找不到我的解决方案？解决这个问题的任何帮助或建议都会很有帮助。
有很多方法可以提取个人变量对PC的贡献。对于数字输入，可以使用
prcomp
运行PCA，并查看
$rotation
（我很快谈到了，但忘了这里有一些因素，所以
prcomp
无法直接工作）。由于您使用的是
factoextra:：fviz_contrib
，因此检查该函数如何在引擎盖下提取此信息是有意义的。键
factoextra:：fviz_contrib
并读取函数：

> factoextra::fviz_contrib function (X, choice = c("row", "col", "var", "ind", "quanti.var", "quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue", color = "steelblue", sort.val = c("desc", "asc", "none"), top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(), ...) { sort.val <- match.arg(sort.val) choice = match.arg(choice) title <- .build_title(choice[1], "Contribution", axes) dd <- facto_summarize(X, element = choice, result = "contrib", axes = axes) contrib <- dd$contrib names(contrib) <- rownames(dd) theo_contrib <- 100/length(contrib) if (length(axes) > 1) { eig <- get_eigenvalue(X)[axes, 1] theo_contrib <- sum(theo_contrib * eig)/sum(eig) } df <- data.frame(name = factor(names(contrib), levels = names(contrib)), contrib = contrib) if (choice == "quanti.var") { df$Groups <- .get_quanti_var_groups(X) if (missing(fill)) fill <- "Groups" if (missing(color)) color <- "Groups" } p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill, color = color, sort.val = sort.val, top = top, main = title, xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt, ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib, linetype = 2, color = "red") p } <environment: namespace:factoextra>
这是对应于图2的表格。对于PC2，使用轴=2等
关于“如何以编程方式确定PC的列索引”，我不能100%确定我是否理解您的要求，但如果您只想说“finalmark”列，请抓住它对PC3的贡献，您可以执行以下操作：

library(tidyverse) # make a tidy table of all column names in the original df with their contributions to all PCs contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC") # get the contribution of column 'finalmark' by name contribution_df %>% filter(name == "finalmark") # get the contribution of column 'finalmark' to PC3 contribution_df %>% filter(name == "finalmark" & PC == 3) # or, just the numeric value of contribution filter(contribution_df, name == "finalmark" & PC == 3)$contrib

库（tidyverse） #将原始df中的所有列名及其对所有PC的贡献制作一个整洁的表格出资额_df% 过滤器（名称=“最终标记”） #获取“finalmark”列对PC3的贡献贡献_df%>% 过滤器（名称=“finalmark”&PC==3） #或者，只是贡献的数值筛选（贡献度，名称=“最终标记”&个人计算机==3）$contrib

顺便说一句，我认为在你的例子中，
ID
被视为数字，而不是因子，但因为这只是一个例子，我不想麻烦它。
不确定我对你的问题的解释是否正确，如果不正确，请道歉。从我收集的信息来看，您正在使用PCA作为初始工具，向您展示在解释数据集时哪些变量最重要。然后，您希望返回原始数据，快速选择这些变量，而无需每次手动编码，并将其用于其他分析
如果这是正确的，那么我保存了贡献图中的数据，过滤出贡献最大的变量，并使用该结果创建了一个仅包含这些变量的新数据框

digits = 0:9 # set seed for reproducibility set.seed(17) # function to create random string createRandString <- function(n = 5000) { a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE)) paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE)) } df <- data.frame(ID=c(1:10), name=sample(letters[1:10]), studLoc=sample(createRandString(10)), finalmark=sample(c(0:100),10), subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10) ) df.princomp <- FactoMineR::FAMD(df, graph = FALSE) factoextra::fviz_screeplot(df.princomp, addlabels = TRUE, barfill = "gray", barcolor = "black", ylim = c(0, 50), xlab = "Principal Component", ylab = "Percentage of explained variance", main = "Principal Component (PC) for mixed variables") #find the top contributing variables to the overall variation in the dataset #here I am choosing the top 10 variables (although we only have 6 in our df). #note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2) f<-factoextra::fviz_contrib(df.princomp, choice = "var", axes = c(1), top = 10, sort.val = c("desc")) #save data from contribution plot dat<-f$data #filter out ID's that are higher than, say, 20 r<-rownames(dat[dat$contrib>20,]) #extract these from your original data frame into a new data frame for further analysis new<-df[r] new #finalmark name studLoc #1 53 b POTYQ0002N #2 73 i LWMTW1195I #3 95 d VTUGO1685F #4 39 f YCGGS5755N #5 97 c GOSWE3283C #6 58 g APBQD6181U #7 67 a VUJOG1460V #8 64 h YXOGP1897F #9 15 j NFUOB6042V #10 81 e QYTHG0783G

数字=0:9 #为再现性设定种子种子（17） #函数创建随机字符串 createRandString那么，为了清楚起见，这里想要的结果到底是什么？图2中的立柱高度？@MikkoMarttila感谢您的关注。这很有帮助。我已经更新了这个问题，以便其他人能更好地理解它（不仅仅是我：）。我相信这是一个微不足道的答案，但我似乎无法理解。所以你是说你想要一些函数f（pc1，pc2），如果pc1是第二个组件，pc2是第三个组件，那么你会得到df.princomp$quanti.var$contrib[，2:3] 和df.princomp$quali.var$contrib[，2:3] ，您的代码不是完全可复制的，它需要一个随机种子。当我运行它时，我会稍微有所不同pcas@Ashish您是否认为，df.princomp$quanti.var$contrib[，4] 是finalMark对Dim-1的贡献？如果是这样，我认为这种解释是不正确的，相反，finalMark对Dim-1的贡献可以在这里找到df.princomp$quanti.var$contrib[“finalMark”，“Dim.1”] ，同样，name和studLoc对Dim-1的贡献可以在df.princomp$quali.var$contrib[1:10，“Dim.1”]>%sum（）找到和df.princomp$quali.var$contrib[11:20，“Dim.1”]%>%sum（）。注意，这些是上面图2的前三列。感谢您提供查看函数结构的指针。这很有帮助。请编辑您的答案（如果可能的话），包括“提取个人变量对个人电脑贡献的多种方法”。这将对我和其他感兴趣的人非常有帮助。接下来，您能否详细说明变量dd中包含的“如何以编程方式确定PCs的列索引”，并将其映射到原始数据帧？谢谢。更新了答案。这有用吗？我想我还是不完全明白你的意思 > dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1) > dd name contrib ID ID 0.9924561 finalmark finalmark 21.4149175 subj1mark subj1mark 7.1874438 subj2mark subj2mark 16.6831560 name name 26.8610132 studLoc studLoc 26.8610132 library(tidyverse) # make a tidy table of all column names in the original df with their contributions to all PCs contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC") # get the contribution of column 'finalmark' by name contribution_df %>% filter(name == "finalmark") # get the contribution of column 'finalmark' to PC3 contribution_df %>% filter(name == "finalmark" & PC == 3) # or, just the numeric value of contribution filter(contribution_df, name == "finalmark" & PC == 3)$contrib digits = 0:9 # set seed for reproducibility set.seed(17) # function to create random string createRandString <- function(n = 5000) { a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE)) paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE)) } df <- data.frame(ID=c(1:10), name=sample(letters[1:10]), studLoc=sample(createRandString(10)), finalmark=sample(c(0:100),10), subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10) ) df.princomp <- FactoMineR::FAMD(df, graph = FALSE) factoextra::fviz_screeplot(df.princomp, addlabels = TRUE, barfill = "gray", barcolor = "black", ylim = c(0, 50), xlab = "Principal Component", ylab = "Percentage of explained variance", main = "Principal Component (PC) for mixed variables") #find the top contributing variables to the overall variation in the dataset #here I am choosing the top 10 variables (although we only have 6 in our df). #note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2) f<-factoextra::fviz_contrib(df.princomp, choice = "var", axes = c(1), top = 10, sort.val = c("desc")) #save data from contribution plot dat<-f$data #filter out ID's that are higher than, say, 20 r<-rownames(dat[dat$contrib>20,]) #extract these from your original data frame into a new data frame for further analysis new<-df[r] new #finalmark name studLoc #1 53 b POTYQ0002N #2 73 i LWMTW1195I #3 95 d VTUGO1685F #4 39 f YCGGS5755N #5 97 c GOSWE3283C #6 58 g APBQD6181U #7 67 a VUJOG1460V #8 64 h YXOGP1897F #9 15 j NFUOB6042V #10 81 e QYTHG0783G #top contributors to both Dim 1 and 2 f<-factoextra::fviz_contrib(df.princomp, choice = "var", axes = c(1,2), top = 10, sort.val = c("desc")) #save data from contribution plot dat<-f$data #filter out ID's that are higher than 5 r<-rownames(dat[dat$contrib>5,]) #extract these from your original data frame into a new data frame for further analysis new<-df[r] new