String PySpark上分类输入的随机森林回归
我一直在尝试在PySpark上建立一个简单的随机森林回归模型。我在R上有不错的机器学习经验。然而,对我来说,Pyspark上的ML似乎完全不同——特别是在处理分类变量、字符串索引和OneHotEncoding时(当只有数值变量时,我能够通过以下示例执行RF回归)。虽然有很多示例可用于处理分类变量,例如和,但我没有成功地使用它们,因为它们中的大多数都超出了我的理解范围(可能是因为我不熟悉Python ML)。我将感谢任何能帮助解决这个问题的人 以下是我的尝试: 输出为:String PySpark上分类输入的随机森林回归,string,machine-learning,pyspark,one-hot-encoding,String,Machine Learning,Pyspark,One Hot Encoding,我一直在尝试在PySpark上建立一个简单的随机森林回归模型。我在R上有不错的机器学习经验。然而,对我来说,Pyspark上的ML似乎完全不同——特别是在处理分类变量、字符串索引和OneHotEncoding时(当只有数值变量时,我能够通过以下示例执行RF回归)。虽然有很多示例可用于处理分类变量,例如和,但我没有成功地使用它们,因为它们中的大多数都超出了我的理解范围(可能是因为我不熟悉Python ML)。我将感谢任何能帮助解决这个问题的人 以下是我的尝试: 输出为: DataFrame[ID:
DataFrame[ID: int, Country: string, Carrier: double, TrafficType: string, ClickDate: timestamp, Device: string, Browser: string, OS: string, RefererUrl: string, UserIp: string, ConversionStatus: string, ConversionDate: string, ConversionPayOut: string, publisherId: string, subPublisherId: string, advertiserCampaignId: double, Fraud: double]
接下来,我选择感兴趣的变量:
IMP = ["Country","Carrier","TrafficType","Device","Browser","OS","Fraud","ConversionPayOut"]
train = train.fillna("XXX")
train = train.select([column for column in train.columns if column in IMP])
from pyspark.sql.types import DoubleType
train = train.withColumn("ConversionPayOut", train["ConversionPayOut"].cast("double"))
train.cache()
输出为:
DataFrame[Country: string, Carrier: double, TrafficType: string, Device: string, Browser: string, OS: string, ConversionPayOut: double, Fraud: double]
我的因变量是ConversionPayOut
,以前的字符串类型现在转换为双精度类型
从这里开始我的困惑:
基于,我知道我必须将我的分类stringtype变量转换为onehot编码向量。以下是我的尝试:
首先是字符串索引:
`
`
字符串索引的输出:
`
`
`
一次热编码后的输出如下所示:
`
`
我不知道如何继续前进。事实上,我不知道哪些Spark机器学习软件包需要我们进行这种热编码,哪些不需要
如果StackOverflow社区能够澄清如何前进,那么对于PySpark的所有新手来说,这将是一个非常好的学习 要对预处理的数据运行随机林,可以继续执行以下代码
来自pyspark.ml.feature导入向量汇编程序
从pyspark.ml.classification导入随机森林分类器
#使用VectorAssembler将所有要素列合并为一个向量列
assemblerInputs=[“承运人”、“欺诈”、“国家索引”、“流量类型索引”、“设备索引”、“浏览器索引”、“操作系统索引”]
assembler=VectorAssembler(inputCols=assemblerInputs,outputCol=“features”)
管道=管道(阶段=汇编程序)
df=管道。装配(列车)。变换(列车)
df=带列的df(“标签”,列转换支出)
#将数据随机拆分为训练和测试数据集
(训练数据,测试数据)=df.随机分割([0.7,0.3],种子=111)
#训练随机森林模型
rf=随机森林分类器(labelCol=“label”,featuresCol=“features”)
rf\U型号=rf.配合(列车数据)
#对测试数据进行预测
预测=射频模型转换(测试数据)
希望这有帮助 下面是一个综合示例(数据文件在上共享)-
包com.nik.spark.ml.examples.returnal.random
导入org.apache.spark.ml.classification.logisticReturnal
导入org.apache.spark.sql.SparkSession
导入scala.Range
导入org.apache.spark.ml.classification.RandomForestClassifier
对象随机性ForestDemo{
def main(参数:数组[字符串]){
//可选:使用以下代码设置错误报告
导入org.apache.log4j_
Logger.getLogger(“org”).setLevel(Level.ERROR)
//星火会议
val spark=SparkSession.builder().master(“local[*]”)。getOrCreate()
//使用Spark读取泰坦尼克号csv文件。
val data=spark.read.option(“header”,“true”).option(“inferSchema”,“true”).format(“csv”).load(“成人培训.csv”)
//打印数据帧的架构
data.printSchema()
///////////////////////
///显示数据/////
/////////////////////
val colnames=data.columns
val firstrow=数据头(1)(0)
println(“\n”)
println(“示例数据行”)
对于(ind println(x)})
//用于度量和评估
导入org.apache.spark.mllib.evaluation.MulticlassMetrics
//需要转换为RDD才能使用此
val predictionAndLabels=results.select($“prediction”,$“label”).as[(Double,Double)].rdd
//实例化度量对象
val度量=新的多类度量(预测和标记)
//混淆矩阵
println(“混淆矩阵:”)
println(metrics.confusionMatrix)
println(度量精度)
}
}
谢谢您的回答。这与我尝试的相似。但是在运行矢量汇编程序之后,我遇到了新的错误。你能看一下这个问题吗@kasa你能试试这段代码吗?如果你仍然遇到同样的错误,请告诉我们?
DataFrame[Country: string, Carrier: double, TrafficType: string, Device: string, Browser: string, OS: string, ConversionPayOut: double, Fraud: double]
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(junk) for column in list(set(junk.columns)-set(['Carrier','ConversionPayOut','Fraud'])) ]
pipeline = Pipeline(stages=indexers)
train_catind = pipeline.fit(train).transform(train)
train_catind.show()
+-------+-------+-----------+-------+--------------+-------+------------------+-----+-----------------+-------------+-------------+--------+------------+
|Country|Carrier|TrafficType| Device| Browser| OS| ConversionPayOut|Fraud|TrafficType_index|Country_index|Browser_index|OS_index|Device_index|
+-------+-------+-----------+-------+--------------+-------+------------------+-----+-----------------+-------------+-------------+--------+------------+
| TH| 20.0| A| Lava| chrome|Android| 41.6| 0.0| 0.0| 1.0| 0.0| 0.0| 7.0|
| BR| 217.0| A| LG| chrome|Android| 26.2680574| 0.0| 0.0| 2.0| 0.0| 0.0| 5.0|
| TH| 20.0| A|Generic| chrome|Android| 41.6| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|`
Next, I think, I have to do the OneHOtEncoding of the String Indexes:
from pyspark.ml.feature import OneHotEncoder, StringIndexer
indexers_ON = [OneHotEncoder(inputCol=column, outputCol=column+"_Vec") for column in filter(lambda x: x.endswith('_index'), train_catind.columns) ]
pipeline = Pipeline(stages=indexers_ON)
train_OHE = pipeline.fit(train_catind).transform(train_catind)
train_OHE.show()
+-------+-------+-----------+-------+--------------+-------+------------------+-----+-----------------+-------------+-------------+--------+------------+---------------------+-----------------+-----------------+-------------+----------------+
|Country|Carrier|TrafficType| Device| Browser| OS| ConversionPayOut|Fraud|TrafficType_index|Country_index|Browser_index|OS_index|Device_index|TrafficType_index_Vec|Country_index_Vec|Browser_index_Vec| OS_index_Vec|Device_index_Vec|
+-------+-------+-----------+-------+--------------+-------+------------------+-----+-----------------+-------------+-------------+--------+------------+---------------------+-----------------+-----------------+-------------+----------------+
| TH| 20.0| A| Lava| chrome|Android| 41.6| 0.0| 0.0| 1.0| 0.0| 0.0| 7.0| (1,[0],[1.0])| (9,[1],[1.0])| (5,[0],[1.0])|(1,[0],[1.0])| (15,[7],[1.0])|
| BR| 217.0| A| LG| chrome|Android| 26.2680574| 0.0| 0.0| 2.0| 0.0| 0.0| 5.0| (1,[0],[1.0])| (9,[2],[1.0])| (5,[0],[1.0])|(1,[0],[1.0])| (15,[5],[1.0])|
| TH| 20.0| A|Generic| chrome|Android| 41.6| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0| (1,[0],[1.0])| (9,[1],[1.0])| (5,[0],[1.0])|(1,[0],[1.0])| (15,[0],[1.0])|
package com.nik.spark.ml.examples.regression.randomForest
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.sql.SparkSession
import scala.Range
import org.apache.spark.ml.classification.RandomForestClassifier
object RandomForestDemo {
def main(args: Array[String]) {
// Optional: Use the following code below to set the Error reporting
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)
// Spark Session
val spark = SparkSession.builder().master("local[*]").getOrCreate()
// Use Spark to read in the Titanic csv file.
val data = spark.read.option("header", "true").option("inferSchema", "true").format("csv").load("adult-training.csv")
// Print the Schema of the DataFrame
data.printSchema()
///////////////////////
/// Display Data /////
/////////////////////
val colnames = data.columns
val firstrow = data.head(1)(0)
println("\n")
println("Example Data Row")
for (ind <- Range(1, colnames.length)) {
println(colnames(ind))
println(firstrow(ind))
println("\n")
}
////////////////////////////////////////////////////
//// Setting Up DataFrame for Machine Learning ////
//////////////////////////////////////////////////
import spark.implicits._
// Grab only the columns we want
val logregdataall = data.select($"income", $"workclass", $"fnlwgt", $"education", $"education-num", $"marital-status", $"occupation", $"relationship", $"race", $"sex", $"capital-gain", $"capital-loss", $"hours-per-week", $"native-country")
val logregdata = logregdataall.na.drop()
// A few things we need to do before Spark can accept the data!
// Convert categorical columns into a binary vector using one hot encoder
// We need to deal with the Categorical columns
// Import VectorAssembler and Vectors
import org.apache.spark.ml.feature.{ VectorAssembler, StringIndexer, VectorIndexer, OneHotEncoder }
import org.apache.spark.ml.linalg.Vectors
// Deal with Categorical Columns
// Transform string type columns to string indexer
val workclassIndexer = new StringIndexer().setInputCol("workclass").setOutputCol("workclassIndex")
val educationIndexer = new StringIndexer().setInputCol("education").setOutputCol("educationIndex")
val maritalStatusIndexer = new StringIndexer().setInputCol("marital-status").setOutputCol("maritalStatusIndex")
val occupationIndexer = new StringIndexer().setInputCol("occupation").setOutputCol("occupationIndex")
val relationshipIndexer = new StringIndexer().setInputCol("relationship").setOutputCol("relationshipIndex")
val raceIndexer = new StringIndexer().setInputCol("race").setOutputCol("raceIndex")
val sexIndexer = new StringIndexer().setInputCol("sex").setOutputCol("sexIndex")
val nativeCountryIndexer = new StringIndexer().setInputCol("native-country").setOutputCol("nativeCountryIndex")
val incomeIndexer = new StringIndexer().setInputCol("income").setOutputCol("incomeIndex")
// Transform string type columns to string indexer
val workclassEncoder = new OneHotEncoder().setInputCol("workclassIndex").setOutputCol("workclassVec")
val educationEncoder = new OneHotEncoder().setInputCol("educationIndex").setOutputCol("educationVec")
val maritalStatusEncoder = new OneHotEncoder().setInputCol("maritalStatusIndex").setOutputCol("maritalVec")
val occupationEncoder = new OneHotEncoder().setInputCol("occupationIndex").setOutputCol("occupationVec")
val relationshipEncoder = new OneHotEncoder().setInputCol("relationshipIndex").setOutputCol("relationshipVec")
val raceEncoder = new OneHotEncoder().setInputCol("raceIndex").setOutputCol("raceVec")
val sexEncoder = new OneHotEncoder().setInputCol("sexIndex").setOutputCol("sexVec")
val nativeCountryEncoder = new OneHotEncoder().setInputCol("nativeCountryIndex").setOutputCol("nativeCountryVec")
val incomeEncoder = new StringIndexer().setInputCol("incomeIndex").setOutputCol("label")
// Assemble everything together to be ("label","features") format
/* val assembler = (new VectorAssembler()
.setInputCols(Array("workclassVec", "fnlwgt", "educationVec", "education-num", "maritalVec", "occupationVec", "relationshipVec", "raceVec", "sexVec", "capital-gain", "capital-loss", "hours-per-week", "nativeCountryVec"))
.setOutputCol("features"))*/
val assembler = (new VectorAssembler()
.setInputCols(Array("workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country", "income"))
.setOutputCol("features"))
////////////////////////////
/// Split the Data ////////
//////////////////////////
val Array(training, test) = logregdata.randomSplit(Array(0.7, 0.3), seed = 12345)
///////////////////////////////
// Set Up the Pipeline ///////
/////////////////////////////
import org.apache.spark.ml.Pipeline
val lr = new RandomForestClassifier().setNumTrees(10)
//val pipeline = new Pipeline().setStages(Array(workclassIndexer, educationIndexer, maritalStatusIndexer, occupationIndexer, relationshipIndexer, raceIndexer, sexIndexer, nativeCountryIndexer, incomeIndexer, workclassEncoder, educationEncoder, maritalStatusEncoder, occupationEncoder, relationshipEncoder, raceEncoder, sexEncoder, nativeCountryEncoder, incomeEncoder, assembler, lr))
val pipeline = new Pipeline().setStages(Array(assembler, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Get Results on Test Set
val results = model.transform(test)
////////////////////////////////////
//// MODEL EVALUATION /////////////
//////////////////////////////////
println("schema")
println(results.select($"label").distinct().foreach { x => println(x) })
// For Metrics and Evaluation
import org.apache.spark.mllib.evaluation.MulticlassMetrics
// Need to convert to RDD to use this
val predictionAndLabels = results.select($"prediction", $"label").as[(Double, Double)].rdd
// Instantiate metrics object
val metrics = new MulticlassMetrics(predictionAndLabels)
// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)
println(metrics.accuracy)
}
}