SparkR summary()正在提取

SparkR summary()正在提取,r,apache-spark,sparkr,R,Apache Spark,Sparkr,我对SparkR中使用随机森林回归的summary()方法有一个疑问。模型构建过程运行良好,但我对算法结果之一的特性重要性感兴趣。我想将特性重要性变量存储到SparkDataFrame中以可视化它们,但我不知道如何传输/提取它 model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureS

我对SparkR中使用随机森林回归的summary()方法有一个疑问。模型构建过程运行良好,但我对算法结果之一的特性重要性感兴趣。我想将特性重要性变量存储到SparkDataFrame中以可视化它们,但我不知道如何传输/提取它

model <- spark.randomForest(x1, x2 , x3, type = "regression", maxDepth = 30, maxBins = 50, numTrees=50, impurity="variance", featureSubsetStrategy="all")

summaryRF <- summary(model)

summaryRF$feature:
1. 'x1'
2. 'x2'
3. 'x3'

summaryRF$featureImportances: 
'(3,[0,1,2],[0.01324152135,0.0545454422,0.0322122334])'

model
summaryRF
不再是
SparkDataFrame
了,这就是为什么
collect
不起作用的原因:)

summaryRF$featureImportances
是一个
字符串
(在
Spark
侧,它是一个
SparseVector
,当前(2.1.0版)无法在
R
之间序列化,我想这就是它被强制转换为
字符串的原因)

据我所知,您必须通过直接操作字符串来提取相关位:

# extract the feature indexes and feature importances strings:
fimpList <- strsplit(gsub("\\(.*?\\[","",summaryRF$featureImportances),"\\],\\[")

# split the index and feature importances strings into vectors (and remove "])" from the last record):
fimp <- lapply(fimpList, function(x) strsplit(gsub("\\]\\)","",x),","))

# it's now a list of lists, but you can make this into a dataframe if you like:
fimpDF <- as.data.frame(do.call(cbind,(fimp[[1]])))
featureNameAndIndex <- data.frame(featureName = unlist(summaryRf$features),
                                  featureIndex = c(0:(length(summaryRf$features)-1))),
                                  stringsAsFactors = FALSE)