Python 如何将具有大量唯一值的数字特征传递给PySpark MlLib中的随机森林回归算法？_Python_Apache Spark_Pyspark_Random Forest_Binning

Python 如何将具有大量唯一值的数字特征传递给PySpark MlLib中的随机森林回归算法？

python apache-spark pyspark

Python 如何将具有大量唯一值的数字特征传递给PySpark MlLib中的随机森林回归算法？,python,apache-spark,pyspark,random-forest,binning,Python,Apache Spark,Pyspark,Random Forest,Binning,我有一个数据集，它有一个数值特征列，该列具有大量的唯一值（数量级为10000）。我知道，当我们在PySpark中为随机森林回归算法生成模型时，我们传递了一个参数maxBins，该参数至少应等于所有特征中的最大唯一值。因此，如果我将10000作为maxBins值传递，那么算法将无法承受负载，它将失败或永远为否。如何将这样的功能传递给模型？我在一些地方读到过关于将值装箱到bucket中，然后将这些bucket传递给模型的文章，但我不知道如何在PySpark中做到这一点。有人可以展示一个示例代码来实现

我有一个

数据集

，它有一个

数值特征

列，该列具有大量的唯一值（数量级为

）。我知道，当我们在

PySpark

中为

随机森林回归算法生成模型时，我们传递了一个参数maxBins
，该参数至少应等于所有特征中的最大唯一值。因此，如果我将10000
作为maxBins
值传递，那么算法将无法承受负载，它将失败或永远为否。如何将这样的功能传递给模型？我在一些地方读到过关于将值装箱到bucket中，然后将这些bucket传递给模型的文章，但我不知道如何在PySpark中做到这一点。有人可以展示一个示例代码来实现这一点吗？我目前的代码是：
    def parse(line):
        # line[6] and line[8] are feature columns with large unique values. line[12] is numeric label
        return (line[1],line[3],line[4],line[5],line[6],line[8],line[11],line[12])


    input = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>=0).map(lambda (line, rownum): line)



    parsed_data = (input
        .map(lambda line: line.split(","))
        .filter(lambda line: len(line) >1 )
        .map(parse))


    # Divide the input data in training and test set with 70%-30% ratio
    (train_data, test_data) = parsed_data.randomSplit([0.7, 0.3])

    label_col = "x7"


# converting RDD to dataframe. x4 and x5 are columns with large unique values
train_data_df = train_data.toDF(("x0","x1","x2","x3","x4","x5","x6","x7"))

# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
   for x in train_data_df.columns if x != label_col 
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in train_data_df.columns if x != label_col ],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(train_data_df)
indexed = model.transform(train_data_df)

label_points = (indexed
.select(col(label_col).cast("float").alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))

如果有人能提供一个示例代码，说明如何修改上面的代码，以将上面的两个大数值功能列合并，这将非常有用
我们传递一个参数maxBins，该参数至少应等于所有特性中的最大唯一值
这不是真的。它应该大于或等于分类特征的最大类别数。您仍然需要调整此参数以获得所需的性能，但除此之外，这里没有其他事情要做