String PySpark上分类输入的随机森林回归_String_Machine Learning_Pyspark_One Hot Encoding

String PySpark上分类输入的随机森林回归

string machine-learning pyspark

String PySpark上分类输入的随机森林回归,string,machine-learning,pyspark,one-hot-encoding,String,Machine Learning,Pyspark,One Hot Encoding,我一直在尝试在PySpark上建立一个简单的随机森林回归模型。我在R上有不错的机器学习经验。然而，对我来说，Pyspark上的ML似乎完全不同——特别是在处理分类变量、字符串索引和OneHotEncoding时（当只有数值变量时，我能够通过以下示例执行RF回归）。虽然有很多示例可用于处理分类变量，例如和，但我没有成功地使用它们，因为它们中的大多数都超出了我的理解范围（可能是因为我不熟悉Python ML）。我将感谢任何能帮助解决这个问题的人以下是我的尝试：输出为： DataFrame[ID:

我一直在尝试在PySpark上建立一个简单的随机森林回归模型。我在R上有不错的机器学习经验。然而，对我来说，Pyspark上的ML似乎完全不同——特别是在处理分类变量、字符串索引和OneHotEncoding时（当只有数值变量时，我能够通过以下示例执行RF回归）。虽然有很多示例可用于处理分类变量，例如和，但我没有成功地使用它们，因为它们中的大多数都超出了我的理解范围（可能是因为我不熟悉Python ML）。我将感谢任何能帮助解决这个问题的人

以下是我的尝试：

输出为：

DataFrame[ID: int, Country: string, Carrier: double, TrafficType: string, ClickDate: timestamp, Device: string, Browser: string, OS: string, RefererUrl: string, UserIp: string, ConversionStatus: string, ConversionDate: string, ConversionPayOut: string, publisherId: string, subPublisherId: string, advertiserCampaignId: double, Fraud: double]

接下来，我选择感兴趣的变量：

IMP = ["Country","Carrier","TrafficType","Device","Browser","OS","Fraud","ConversionPayOut"]
train = train.fillna("XXX")
train = train.select([column for column in train.columns if column in IMP])
from pyspark.sql.types import DoubleType
train = train.withColumn("ConversionPayOut", train["ConversionPayOut"].cast("double"))
train.cache()

输出为：

DataFrame[Country: string, Carrier: double, TrafficType: string, Device: string, Browser: string, OS: string, ConversionPayOut: double, Fraud: double]

我的因变量是

ConversionPayOut

，以前的字符串类型现在转换为双精度类型

从这里开始我的困惑：基于，我知道我必须将我的分类stringtype变量转换为onehot编码向量。以下是我的尝试：

首先是字符串索引：

字符串索引的输出：

一次热编码后的输出如下所示：

我不知道如何继续前进。事实上，我不知道哪些Spark机器学习软件包需要我们进行这种热编码，哪些不需要

如果StackOverflow社区能够澄清如何前进，那么对于PySpark的所有新手来说，这将是一个非常好的学习

要对预处理的数据运行随机林，可以继续执行以下代码

来自pyspark.ml.feature导入向量汇编程序
从pyspark.ml.classification导入随机森林分类器
#使用VectorAssembler将所有要素列合并为一个向量列
assemblerInputs=[“承运人”、“欺诈”、“国家索引”、“流量类型索引”、“设备索引”、“浏览器索引”、“操作系统索引”]
assembler=VectorAssembler（inputCols=assemblerInputs，outputCol=“features”）
管道=管道（阶段=汇编程序）
df=管道。装配（列车）。变换（列车）
df=带列的df（“标签”，列转换支出）
#将数据随机拆分为训练和测试数据集
（训练数据，测试数据）=df.随机分割（[0.7,0.3]，种子=111）
#训练随机森林模型
rf=随机森林分类器（labelCol=“label”，featuresCol=“features”）
rf\U型号=rf.配合（列车数据）
#对测试数据进行预测
预测=射频模型转换（测试数据）

希望这有帮助

下面是一个综合示例（数据文件在上共享）-

包com.nik.spark.ml.examples.returnal.random
导入org.apache.spark.ml.classification.logisticReturnal
导入org.apache.spark.sql.SparkSession
导入scala.Range
导入org.apache.spark.ml.classification.RandomForestClassifier
对象随机性ForestDemo{
def main（参数：数组[字符串]）{
//可选：使用以下代码设置错误报告
导入org.apache.log4j_
Logger.getLogger（“org”）.setLevel（Level.ERROR）
//星火会议
val spark=SparkSession.builder（）.master（“local[*]”）。getOrCreate（）
//使用Spark读取泰坦尼克号csv文件。
val data=spark.read.option（“header”，“true”）.option（“inferSchema”，“true”）.format（“csv”）.load（“成人培训.csv”）
//打印数据帧的架构
data.printSchema（）
///////////////////////
///显示数据/////
/////////////////////
val colnames=data.columns
val firstrow=数据头（1）（0）
println（“\n”）
println（“示例数据行”）
对于（ind println（x）}）
//用于度量和评估
导入org.apache.spark.mllib.evaluation.MulticlassMetrics
//需要转换为RDD才能使用此
val predictionAndLabels=results.select（$“prediction”，$“label”）.as[（Double，Double）].rdd
//实例化度量对象
val度量=新的多类度量（预测和标记）
//混淆矩阵
println（“混淆矩阵：”）
println（metrics.confusionMatrix）
println（度量精度）
}
}

谢谢您的回答。这与我尝试的相似。但是在运行矢量汇编程序之后，我遇到了新的错误。你能看一下这个问题吗@kasa你能试试这段代码吗？如果你仍然遇到同样的错误，请告诉我们？

DataFrame[Country: string, Carrier: double, TrafficType: string, Device: string, Browser: string, OS: string, ConversionPayOut: double, Fraud: double]

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(junk) for column in list(set(junk.columns)-set(['Carrier','ConversionPayOut','Fraud'])) ]
pipeline = Pipeline(stages=indexers)
train_catind = pipeline.fit(train).transform(train)
train_catind.show()

+-------+-------+-----------+-------+--------------+-------+------------------+-----+-----------------+-------------+-------------+--------+------------+
|Country|Carrier|TrafficType| Device|       Browser|     OS|  ConversionPayOut|Fraud|TrafficType_index|Country_index|Browser_index|OS_index|Device_index|
+-------+-------+-----------+-------+--------------+-------+------------------+-----+-----------------+-------------+-------------+--------+------------+
|     TH|   20.0|          A|   Lava|        chrome|Android|              41.6|  0.0|              0.0|          1.0|          0.0|     0.0|         7.0|
|     BR|  217.0|          A|     LG|        chrome|Android|        26.2680574|  0.0|              0.0|          2.0|          0.0|     0.0|         5.0|
|     TH|   20.0|          A|Generic|        chrome|Android|              41.6|  0.0|              0.0|          1.0|          0.0|     0.0|         0.0|`


Next, I think, I have to do the OneHOtEncoding of the String Indexes:

from pyspark.ml.feature import OneHotEncoder, StringIndexer
indexers_ON = [OneHotEncoder(inputCol=column, outputCol=column+"_Vec") for column in filter(lambda x: x.endswith('_index'), train_catind.columns) ]
pipeline = Pipeline(stages=indexers_ON)
train_OHE = pipeline.fit(train_catind).transform(train_catind)
train_OHE.show()

+-------+-------+-----------+-------+--------------+-------+------------------+-----+-----------------+-------------+-------------+--------+------------+---------------------+-----------------+-----------------+-------------+----------------+
|Country|Carrier|TrafficType| Device|       Browser|     OS|  ConversionPayOut|Fraud|TrafficType_index|Country_index|Browser_index|OS_index|Device_index|TrafficType_index_Vec|Country_index_Vec|Browser_index_Vec| OS_index_Vec|Device_index_Vec|
+-------+-------+-----------+-------+--------------+-------+------------------+-----+-----------------+-------------+-------------+--------+------------+---------------------+-----------------+-----------------+-------------+----------------+
|     TH|   20.0|          A|   Lava|        chrome|Android|              41.6|  0.0|              0.0|          1.0|          0.0|     0.0|         7.0|        (1,[0],[1.0])|    (9,[1],[1.0])|    (5,[0],[1.0])|(1,[0],[1.0])|  (15,[7],[1.0])|
|     BR|  217.0|          A|     LG|        chrome|Android|        26.2680574|  0.0|              0.0|          2.0|          0.0|     0.0|         5.0|        (1,[0],[1.0])|    (9,[2],[1.0])|    (5,[0],[1.0])|(1,[0],[1.0])|  (15,[5],[1.0])|
|     TH|   20.0|          A|Generic|        chrome|Android|              41.6|  0.0|              0.0|          1.0|          0.0|     0.0|         0.0|        (1,[0],[1.0])|    (9,[1],[1.0])|    (5,[0],[1.0])|(1,[0],[1.0])|  (15,[0],[1.0])|

package com.nik.spark.ml.examples.regression.randomForest

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.sql.SparkSession
import scala.Range
import org.apache.spark.ml.classification.RandomForestClassifier

object RandomForestDemo {

  def main(args: Array[String]) {
    // Optional: Use the following code below to set the Error reporting
    import org.apache.log4j._
    Logger.getLogger("org").setLevel(Level.ERROR)

    // Spark Session
    val spark = SparkSession.builder().master("local[*]").getOrCreate()

    // Use Spark to read in the Titanic csv file.
    val data = spark.read.option("header", "true").option("inferSchema", "true").format("csv").load("adult-training.csv")

    // Print the Schema of the DataFrame
    data.printSchema()

    ///////////////////////
    /// Display Data /////
    /////////////////////
    val colnames = data.columns
    val firstrow = data.head(1)(0)
    println("\n")
    println("Example Data Row")
    for (ind <- Range(1, colnames.length)) {
      println(colnames(ind))
      println(firstrow(ind))
      println("\n")
    }

    ////////////////////////////////////////////////////
    //// Setting Up DataFrame for Machine Learning ////
    //////////////////////////////////////////////////
    import spark.implicits._
    // Grab only the columns we want
    val logregdataall = data.select($"income", $"workclass", $"fnlwgt", $"education", $"education-num", $"marital-status", $"occupation", $"relationship", $"race", $"sex", $"capital-gain", $"capital-loss", $"hours-per-week", $"native-country")
    val logregdata = logregdataall.na.drop()

    // A few things we need to do before Spark can accept the data!
    // Convert categorical columns into a binary vector using one hot encoder
    // We need to deal with the Categorical columns

    // Import VectorAssembler and Vectors
    import org.apache.spark.ml.feature.{ VectorAssembler, StringIndexer, VectorIndexer, OneHotEncoder }
    import org.apache.spark.ml.linalg.Vectors

    // Deal with Categorical Columns
    // Transform string type columns to string indexer 
    val workclassIndexer = new StringIndexer().setInputCol("workclass").setOutputCol("workclassIndex")
    val educationIndexer = new StringIndexer().setInputCol("education").setOutputCol("educationIndex")
    val maritalStatusIndexer = new StringIndexer().setInputCol("marital-status").setOutputCol("maritalStatusIndex")
    val occupationIndexer = new StringIndexer().setInputCol("occupation").setOutputCol("occupationIndex")
    val relationshipIndexer = new StringIndexer().setInputCol("relationship").setOutputCol("relationshipIndex")
    val raceIndexer = new StringIndexer().setInputCol("race").setOutputCol("raceIndex")
    val sexIndexer = new StringIndexer().setInputCol("sex").setOutputCol("sexIndex")
    val nativeCountryIndexer = new StringIndexer().setInputCol("native-country").setOutputCol("nativeCountryIndex")
    val incomeIndexer = new StringIndexer().setInputCol("income").setOutputCol("incomeIndex")

    // Transform string type columns to string indexer 
    val workclassEncoder = new OneHotEncoder().setInputCol("workclassIndex").setOutputCol("workclassVec")
    val educationEncoder = new OneHotEncoder().setInputCol("educationIndex").setOutputCol("educationVec")
    val maritalStatusEncoder = new OneHotEncoder().setInputCol("maritalStatusIndex").setOutputCol("maritalVec")
    val occupationEncoder = new OneHotEncoder().setInputCol("occupationIndex").setOutputCol("occupationVec")
    val relationshipEncoder = new OneHotEncoder().setInputCol("relationshipIndex").setOutputCol("relationshipVec")
    val raceEncoder = new OneHotEncoder().setInputCol("raceIndex").setOutputCol("raceVec")
    val sexEncoder = new OneHotEncoder().setInputCol("sexIndex").setOutputCol("sexVec")
    val nativeCountryEncoder = new OneHotEncoder().setInputCol("nativeCountryIndex").setOutputCol("nativeCountryVec")
    val incomeEncoder = new StringIndexer().setInputCol("incomeIndex").setOutputCol("label")

    // Assemble everything together to be ("label","features") format
  /*  val assembler = (new VectorAssembler()
      .setInputCols(Array("workclassVec", "fnlwgt", "educationVec", "education-num", "maritalVec", "occupationVec", "relationshipVec", "raceVec", "sexVec", "capital-gain", "capital-loss", "hours-per-week", "nativeCountryVec"))
      .setOutputCol("features"))*/
  val assembler = (new VectorAssembler()
      .setInputCols(Array("workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country", "income"))
      .setOutputCol("features"))
    ////////////////////////////
    /// Split the Data ////////
    //////////////////////////
    val Array(training, test) = logregdata.randomSplit(Array(0.7, 0.3), seed = 12345)

    ///////////////////////////////
    // Set Up the Pipeline ///////
    /////////////////////////////
    import org.apache.spark.ml.Pipeline

    val lr = new RandomForestClassifier().setNumTrees(10)

    //val pipeline = new Pipeline().setStages(Array(workclassIndexer, educationIndexer, maritalStatusIndexer, occupationIndexer, relationshipIndexer, raceIndexer, sexIndexer, nativeCountryIndexer, incomeIndexer, workclassEncoder, educationEncoder, maritalStatusEncoder, occupationEncoder, relationshipEncoder, raceEncoder, sexEncoder, nativeCountryEncoder, incomeEncoder, assembler, lr))
    val pipeline = new Pipeline().setStages(Array(assembler, lr))

    // Fit the pipeline to training documents.
    val model = pipeline.fit(training)
    // Get Results on Test Set
    val results = model.transform(test)

    ////////////////////////////////////
    //// MODEL EVALUATION /////////////
    //////////////////////////////////
    println("schema")
    println(results.select($"label").distinct().foreach { x => println(x) })

    // For Metrics and Evaluation
    import org.apache.spark.mllib.evaluation.MulticlassMetrics

    // Need to convert to RDD to use this
    val predictionAndLabels = results.select($"prediction", $"label").as[(Double, Double)].rdd

    // Instantiate metrics object
    val metrics = new MulticlassMetrics(predictionAndLabels)

    // Confusion matrix
    println("Confusion matrix:")
    println(metrics.confusionMatrix)
    println(metrics.accuracy)
  }
}