Pyspark PyForest功能重要性：如何从列编号中获取列名_Pyspark_Apache Spark Mllib_Random Forest_Apache Spark Ml

Pyspark PyForest功能重要性：如何从列编号中获取列名

pyspark

Pyspark PyForest功能重要性：如何从列编号中获取列名,pyspark,apache-spark-mllib,random-forest,apache-spark-ml,Pyspark,Apache Spark Mllib,Random Forest,Apache Spark Ml,我在spark中使用标准（字符串索引器+一个热编码器+随机森林）管道，如下所示 labelIndex=StringIndexer（inputCol=class\u label\u name，outputCol=“indexedLabel”）.fit（数据）字符串\功能\索引器=[ StringIndexer（inputCol=x，outputCol=“int_{0}”.format（x））.fit（数据）对于字符集合名称中的x ] onehot\u编码器=[ OneHotEncoder（in

我在spark中使用标准（字符串索引器+一个热编码器+随机森林）管道，如下所示

labelIndex=StringIndexer（inputCol=class\u label\u name，outputCol=“indexedLabel”）.fit（数据）
字符串\功能\索引器=[
StringIndexer（inputCol=x，outputCol=“int_{0}”.format（x））.fit（数据）
对于字符集合名称中的x
]
onehot\u编码器=[
OneHotEncoder（inputCol=“int_u”+x，outputCol=“onehot{0}”。格式（x））
对于字符集合名称中的x
]
所有列=num\u coll\u-use\u-names+bool\u coll\u-use\u-names+[“onehot”+x代表字符中的x\u-coll\u-use\u-names]
assembler=VectorAssembler（inputCols=[col for col in all_columns]，outputCol=“features”）
rf=RandomForestClassifier（labelCol=“indexedLabel”，featuresCol=“features”，numTrees=100）
labelConverter=IndexToString（inputCol=“prediction”，outputCol=“predictedLabel”，labels=labelIndexer.labels）
管道=管道（阶段=[LabelIndex]+字符串特征索引器+onehot编码器+[assembler，rf，labelConverter]）
crossval=CrossValidator（估计器=管道，
参数映射=参数网格，
评估者=评估者，
numFolds=3）
cvModel=交叉值拟合（训练数据）

现在，在拟合之后，我可以使用

cvModel.bestModel.stages[-2].featureImportances

获得随机林和特征重要性，但这不会给我特征/列名，而只是特征编号

我得到的信息如下：

打印（cvModel.bestModel.stages[-2]。功能重要性）
(1446,[3,4,9,18,20,103,766,981,983,1098,1121,1134,1148,1227,1288,1345,1436,1444],[0.109898803421,0.0967396441648,4.24568235244e-05,0.0369705839109,0.0163489685127,3.2286694534e-06,0.0208192703688,0.081582887175,0.0466903663708,0.0227619959989,0.0850922269211,0.000113388956,0.092477940403,0.163838071392,0.107373548695]

如何将其映射回某些列名或列名+值格式？

基本上是为了获得随机林的功能重要性以及列名。

嘿，为什么不通过列表扩展将其映射回原始列。下面是一个示例：

# in your case: trainingData.columns 
data_frame_columns = ["A", "B", "C", "D", "E", "F"]
# in your case: print(cvModel.bestModel.stages[-2].featureImportances)
feature_importance = (1, [1, 3, 5], [0.5, 0.5, 0.5])

rf_output = [(data_frame_columns[i], feature_importance[2][j]) for i, j in zip(feature_importance[1], range(len(feature_importance[2])))]
dict(rf_output)

{'B': 0.5, 'D': 0.5, 'F': 0.5}

在ml算法之后，我找不到任何方法来获得真正的初始列列表，我正在使用它作为当前的解决方法

print(len(cols_now))

FEATURE_COLS=[]

for x in cols_now:

    if(x[-6:]!="catVar"):

        FEATURE_COLS+=[x]

    else:

        temp=trainingData.select([x[:-7],x[:-6]+"tmp"]).distinct().sort(x[:-6]+"tmp")

        temp_list=temp.select(x[:-7]).collect()

        FEATURE_COLS+=[list(x)[0] for x in temp_list]



print(len(FEATURE_COLS))

print(FEATURE_COLS)

我在所有索引器（_tmp）和编码器（_catVar）中保持了一致的后缀命名，如：

这可以进一步改进和推广，但目前这项繁琐的工作效果最好

转换后的数据集metdata具有所需的属性-

创建pandas数据帧（通常功能列表不会很大，因此存储pandas DF时不会出现内存问题）

然后创建一个广播字典来映射。广播在分布式环境中是必要的

feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 

feature_dict_broad = sc.broadcast(feature_dict)

您也可以查看和

是的，但是您没有注意到stringindexer/onehotencoder之后列名的更改。我想映射到汇编程序组合的列名。我当然可以做很长一段路，但我更关心的是spark（ml）有一些较短的方法，比如scikit learn for the same:）啊，好吧，我的不好。但是，很长的一段路应该仍然有效。我认为目前还没有短的解决方案。Spark ML API没有scikit learn API那么强大和冗长。是的，我知道：），只是想让问题继续开放，以获得建议：）。谢谢DatAbishek，你做得怎么样这到底是什么？这应该是正确的答案-它简洁有效。谢谢！

pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] 
["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")

feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 

feature_dict_broad = sc.broadcast(feature_dict)