Python 3.x 带火花的随机森林：获得预测值和R²；_Python 3.x_Apache Spark_Random Forest_Prediction

Python 3.x 带火花的随机森林：获得预测值和R²；

python-3.x apache-spark

Python 3.x 带火花的随机森林：获得预测值和R²；,python-3.x,apache-spark,random-forest,prediction,Python 3.x,Apache Spark,Random Forest,Prediction,我正在使用spark的MLlib执行回归随机林我在这里使用的是python代码：它可以工作，但现在我想得到预测模型的预测值以及R或R²。如何获取该文件？以下是如何将csv文件保存到RDD（spark数据格式）：以下是如何执行随机森林算法以及如何获得预测值： def random_forest_regression(data): """ Run the random forest (regression) algorithm on the data to perform

我正在使用

spark

的

MLlib

执行

回归随机林

我在这里使用的是

python

代码：

它可以工作，但现在我想得到预测模型的

预测值

以及

或

R²

。

如何获取该文件？

以下是如何将

csv

文件保存到

RDD

（spark数据格式）：

以下是如何执行随机森林算法以及如何获得预测值：

def random_forest_regression(data):
    """
    Run the random forest (regression) algorithm on the data to perform the prediction
    """
    # Split the data into training and test sets (30% held out for testing)
    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={}, numTrees=100, featureSubsetStrategy="auto", impurity='variance', maxDepth=10, maxBins=32)
    #increase number of trees to have a better prediction

    # Evaluate model on TEST instances and compute test error
    predictions_test = model.predict(testData.map(lambda x: x.features))
    real_and_predicted_test = testData.map(lambda lp: lp.label).zip(predictions_test)

    #get the list of real and predicted values FOR ALL THE POINTS
    predictions = model.predict(data.map(lambda x: x.features))
    real_and_predicted = data.map(lambda lp: lp.label).zip(predictions)
    real_and_predicted=real_and_predicted.collect()
    print("real and predicted values")
    for value in real_and_predicted:
        print(value)

    return model, real_and_predicted

为了得到相关系数（

值），我使用了

numpy

：

def compute_correlation_coefficient(real_and_predicted):
    """
    compute and display the correlation coefficient from a list of real and predicted values
    """
    list1=[]
    list2=[]
    for tuple in real_and_predicted:
        list1.append(tuple[0])
        list2.append(tuple[1])
    print("correlation coefficient")
    print(numpy.corrcoef(list1, list2)[0, 1])

要获得

R²

，请取

相关系数的平方值

瞧

您需要预测模型的确定系数？确定系数（

R²

）或相关系数（

），两者中的任意一种。事实上，如果我得到了

预测值的列表，我可以用一个公式来计算它，没有直接的方法可以直接从spark得到它，你必须计算它。预测值如何？如何得到它们？我想将real
和predicted value存储在csv文件中。Map/将训练数据RDD减少为RDD（real，predicted value），然后您可以保存RDD。这是一个非常基本的操作。
def compute_correlation_coefficient(real_and_predicted):
    """
    compute and display the correlation coefficient from a list of real and predicted values
    """
    list1=[]
    list2=[]
    for tuple in real_and_predicted:
        list1.append(tuple[0])
        list2.append(tuple[1])
    print("correlation coefficient")
    print(numpy.corrcoef(list1, list2)[0, 1])