Spark和Python中使用决策树算法进行分析的问题_Python_Machine Learning_Apache Spark_Decision Tree_Pyspark

Spark和Python中使用决策树算法进行分析的问题

python machine-learning apache-spark pyspark

Spark和Python中使用决策树算法进行分析的问题,python,machine-learning,apache-spark,decision-tree,pyspark,Python,Machine Learning,Apache Spark,Decision Tree,Pyspark,我正在为电信行业做一个客户流失分析，我有一个样本数据集。下面是我在Spark中通过python使用decision tree算法的代码。在数据集中，我有多个列，我正在为我的功能选择所需的列 from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.tree import DecisionTree, DecisionTreeModel from pyspark.mllib.util import MLUtils imp

我正在为电信行业做一个客户流失分析，我有一个样本数据集。下面是我在

Spark

中通过

python

使用

decision tree

算法的代码。在数据集中，我有多个列，我正在为我的

功能选择所需的列
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
import os.path
import numpy as np


inputPath = os.path.join('file1.csv')
file_name = os.path.join(inputPath)
data = sc.textFile(file_name).zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)


final_data = data.map(lambda line: line.split(",")).filter(lambda line: len(line)>1).map(lambda line:LabeledPoint(1 if line[5] == 'True' else 0,[line[6],line[7]]))

(trainingdata, testdata) = final_data.randomSplit([0.7, 0.3])

model = DecisionTree.trainRegressor(trainingdata, categoricalFeaturesInfo={},
                                    impurity='variance', maxDepth=5, maxBins=32)

predictions = model.predict(testdata.map(lambda x: x.features))
prediction= predictions.collect()

labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

现在，这段代码可以很好地工作并进行预测，但我缺少的是prediction
集合或testdata
中每个客户的标识符。在我的数据集中有一列用于customerid
（列号4），我现在不选择该列，因为它不是模型中要考虑的特性。对于详细信息在testdata
中的客户，我很难将此customerid
列与testdata
相关联。如果我在标签点
中形成的特征
向量的数据集中添加此列，则这将导致错误，因为它不是特征值
我如何才能在分析中添加此列，以便获得客户流失率较高的前50名客户？
您可以使用与预测后添加标签完全相同的方法
小助手：
customerIndex = ... # Put index of the column

def extract(line):
    """Given a line create a tuple (customerId, labeledPoint)"""
    label = 1 if line[5] == 'True' else 0
    point =  LabeledPoint(label, [line[6], line[7]])
    customerId = line[customerIndex]
    return (customerId, point)

使用提取功能准备日期：
final_data = (data
    .map(lambda line: line.split(","))
    .filter(lambda line: len(line) >1 )
    .map(extract)) # Map to tuples

列车：
# As before
(trainingdata, testdata) = final_data.randomSplit([0.7, 0.3])

# Use only points, put the rest of the arguments in place of ...
model = DecisionTree.trainRegressor(trainingdata.map(lambda x: x[1]), ...)

预测：
# Make predictions using points
predictions = model.predict(testdata.map(lambda x: x[1].features))

# Add customer id and label
labelsIdsAndPredictions = (testData
    .map(lambda x: (x[0], x[1].label))
    .zip(predictions))

摘录前50名：
top50 = labelsIdsAndPredictions.top(50, key=lambda x: x[1])

@你能帮我告诉我我是怎么做的吗，因为我是新来的？