SparkR summary()正在提取
我对SparkR中使用随机森林回归的summary()方法有一个疑问。模型构建过程运行良好,但我对算法结果之一的特性重要性感兴趣。我想将特性重要性变量存储到SparkDataFrame中以可视化它们,但我不知道如何传输/提取它SparkR summary()正在提取,r,apache-spark,sparkr,R,Apache Spark,Sparkr,我对SparkR中使用随机森林回归的summary()方法有一个疑问。模型构建过程运行良好,但我对算法结果之一的特性重要性感兴趣。我想将特性重要性变量存储到SparkDataFrame中以可视化它们,但我不知道如何传输/提取它 model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureS
model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")
summaryRF <- summary(model)
summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'
summaryRF$featureImportances:
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'
modelsummaryRF
不再是SparkDataFrame
了,这就是为什么collect
不起作用的原因:)
summaryRF$featureImportances
是一个字符串
(在Spark
侧,它是一个SparseVector
,当前(2.1.0版)无法在R
之间序列化,我想这就是它被强制转换为字符串的原因)
据我所知,您必须通过直接操作字符串来提取相关位:
# extract the feature indexes and feature importances strings:
fimpList <- strsplit(gsub("\\(.*?\\[","",summaryRF$featureImportances),"\\],\\[")
# split the index and feature importances strings into vectors (and remove "])" from the last record):
fimp <- lapply(fimpList, function(x) strsplit(gsub("\\]\\)","",x),","))
# it's now a list of lists, but you can make this into a dataframe if you like:
fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))
featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features),
featureIndex = c(0:(length(summaryRf$features)-1))),
stringsAsFactors = FALSE)