Apache spark 如何使用spark ml处理分类功能？_Apache Spark_Categorical Data_Apache Spark Ml_Apache Spark Mllib

Apache spark 如何使用spark ml处理分类功能？

apache-spark

Apache spark 如何使用spark ml处理分类功能？,apache-spark,categorical-data,apache-spark-ml,apache-spark-mllib,Apache Spark,Categorical Data,Apache Spark Ml,Apache Spark Mllib,如何使用spark ml而不是spark mllib处理分类数据虽然文档不是很清楚，但分类器（例如，RandomForestClassifier，LogisticRegression）似乎有一个featuresCol参数，该参数指定了DataFrame中特征列的名称，以及一个labelCol参数，它指定数据帧中标记类的列的名称显然，我想在预测中使用多个特性，因此我尝试使用vectorsembler将所有特性放在featuresCol下的单个矢量中然而，VectorAssembler只接受数

如何使用

spark ml

而不是

spark mllib

处理分类数据

虽然文档不是很清楚，但分类器（例如，

RandomForestClassifier

，

LogisticRegression

）似乎有一个

featuresCol

参数，该参数指定了

DataFrame

中特征列的名称，以及一个

labelCol

参数，它指定

数据帧中标记类的列的名称
显然，我想在预测中使用多个特性，因此我尝试使用vectorsembler
将所有特性放在featuresCol
下的单个矢量中
然而，VectorAssembler
只接受数字类型、布尔类型和向量类型（根据Spark网站），因此我不能在我的特征向量中放入字符串
我应该如何进行
 ML管道中有一个名为StringIndexer
的组件，您可以使用它以合理的方式将字符串转换为Double。有更多的文档，并展示了如何构造管道。
我只想完成Holden的回答
由于Spark 2.3.0，OneHotEncoder
已被弃用，它将在3.0.0
中删除。请改用onehotcoderestimator

在Scala中：
import org.apache.spark.ml.Pipeline
导入org.apache.spark.ml.feature.{onehotcoderestimator，StringIndexer}
val df=序列（（0，“a”，1），（1，“b”，2），（2，“c”，3），（3，“a”，4），（4，“a”，4），（5，“c”，3））。toDF（“id”，“类别1”，“类别2”）
val indexer=new StringIndexer（）.setInputCol（“category1”）.setOutputCol（“category1索引”）
val编码器=新的OneHotEncoderEstimator（）
.setInputCols（数组（indexer.getOutputCol，“category2”））
.setOutputCols（数组（“category1Vec”、“category2Vec”））
val pipeline=new pipeline（）.setStages（数组（索引器、编码器））
pipeline.fit（df）.transform（df）.show
// +---+---------+---------+--------------+-------------+-------------+
//|id | category1 | category2 | category1索引| category1Vec | category2Vec|
// +---+---------+---------+--------------+-------------+-------------+
//|0 | a | 1 | 0.0 |（2，[0]，[1.0]）|（4，[1]，[1.0]）|
//| 1 | b | 2 | 2.0 |（2，[]，[]）|（4，[2]，[1.0]）|
//| 2 | c | 3 | 1.0 |（2，[1]，[1.0]）|（4，[3]，[1.0]）|
//| 3 | a | 4 | 0.0 |（2，[0]，[1.0]）|（4，[]，[]）|
//| 4 | a | 4 | 0.0 |（2，[0]，[1.0]）|（4，[]，[]）|
//| 5 | c | 3 | 1.0 |（2，[1]，[1.0]）|（4，[3]，[1.0]）|
// +---+---------+---------+--------------+-------------+-------------+

在Python中：
来自pyspark.ml导入管道
从pyspark.ml.feature导入StringIndexer，OneHotEncoderEstimator
df=spark.createDataFrame（[（0，“a”，1），（1，“b”，2），（2，“c”，3），（3，“a”，4），（4，“a”，4），（5，“c”，3）]，[“id”，“category1”，“category2”]）
indexer=StringIndexer（inputCol=“category1”，outputCol=“category1Index”）
inputs=[indexer.getOutputCol（），“category2”]
编码器=OneHotEncoderEstimator（inputCols=输入，outputCols=[“categoryVec1”，“categoryVec2”]）
管道=管道（阶段=[索引器、编码器]）
pipeline.fit（df.transform（df.show）（）
# +---+---------+---------+--------------+-------------+-------------+
#|id | category1 | category2 | category1索引| categoryVec1 | categoryVec2|
# +---+---------+---------+--------------+-------------+-------------+
#|0 | a | 1 | 0.0 |（2，[0]，[1.0]）|（4，[1]，[1.0]）|
#| 1 | b | 2 | 2.0 |（2，[]，[]）|（4，[2]，[1.0]）|
#| 2 | c | 3 | 1.0 |（2，[1]，[1.0]）|（4，[3]，[1.0]）|
#| 3 | a | 4 | 0.0 |（2，[0]，[1.0]）|（4，[]，[]）|
#| 4 | a | 4 | 0.0 |（2，[0]，[1.0]）|（4，[]，[]）|
#| 5 | c | 3 | 1.0 |（2，[1]，[1.0]）|（4，[3]，[1.0]）|
# +---+---------+---------+--------------+-------------+-------------+

由于Spark 1.4.0，MLLib还提供了一个特性，它将一列标签索引映射到一列二进制向量，最多只有一个值
这种编码允许期望连续特征（如逻辑回归）的算法使用分类特征
让我们考虑下面的<代码>数据文件< /代码>：
val df=Seq（（0，“a”），（1，“b”），（2，“c”），（3，“a”），（4，“a”），（5，“c”））
.toDF（“id”、“类别”）

第一步是使用StringIndexer
创建索引的DataFrame
：
import org.apache.spark.ml.feature.StringIndexer
val indexer=新的StringIndexer（）
.setInputCol（“类别”）
.setOutputCol（“类别索引”）
.fit（df）
val索引=索引器.转换（df）
上映
// +---+--------+-------------+                                                    
//| id |类别|分类索引|
// +---+--------+-------------+
//| 0 | a | 0.0|
//| 1 | b | 2.0|
//| 2 | c | 1.0|
//| 3 | a | 0.0|
//| 4 | a | 0.0|
//| 5 | c | 1.0|
// +---+--------+-------------+

然后，您可以使用OneHotEncoder
对categoryIndex
进行编码：
import org.apache.spark.ml.feature.OneHotEncoder
val编码器=新的OneHotEncoder（）
.setInputCol（“类别索引”）
.setOutputCol（“类别”）
val encoded=编码器。转换（索引）
编码。选择（“id”、“类别”）。显示
// +---+-------------+
//| id |类别EC|
// +---+-------------+
// |  0|(2,[0],[1.0])|
// |  1|    (2,[],[])|
// |  2|(2,[1],[1.0])|
// |  3|(2,[0],[1.0])|
// |  4|(2,[0],[1.0])|
// |  5|(2,[1],[1.0])|
// +---+-------------+
我将从另一个角度提供答案，因为我也想知道
def ohcOneColumn(df, colName, debug=False):

  colsToFillNa = []

  if debug: print("Entering method ohcOneColumn")
  countUnique = df.groupBy(colName).count().count()
  if debug: print(countUnique)

  collectOnce = df.select(colName).distinct().collect()
  for uniqueValIndex in range(countUnique):
    uniqueVal = collectOnce[uniqueValIndex][0]
    if debug: print(uniqueVal)
    newColName = str(colName) + '_' + str(uniqueVal) + '_TF'
    df = df.withColumn(newColName, df[colName]==uniqueVal)
    colsToFillNa.append(newColName)
  df = df.drop(colName)
  df = df.na.fill(False, subset=colsToFillNa)
  return df

from pyspark.sql.functions import col, countDistinct, approxCountDistinct
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator

def detectAndLabelCat(sparkDf, minValCount=5, debug=False, excludeCols=['Target']):
  if debug: print("Entering method detectAndLabelCat")
  newDf = sparkDf
  colList = sparkDf.columns

  for colName in sparkDf.columns:
    uniqueVals = sparkDf.groupBy(colName).count()
    if debug: print(uniqueVals)
    countUnique = uniqueVals.count()
    dtype = str(sparkDf.schema[colName].dataType)
    #dtype = str(df.schema[nc].dataType)
    if (colName in excludeCols):
      if debug: print(str(colName) + ' is in the excluded columns list.')

    elif countUnique == 1:
      newDf = newDf.drop(colName)
      if debug:
        print('dropping column ' + str(colName) + ' because it only contains one unique value.')
      #end if debug
    #elif (1==2):
    elif ((countUnique < minValCount) | (dtype=="String") | (dtype=="StringType")):
      if debug: 
        print(len(newDf.columns))
        oldColumns = newDf.columns
      newDf = ohcOneColumn(newDf, colName, debug=debug)
      if debug: 
        print(len(newDf.columns))
        newColumns = set(newDf.columns) - set(oldColumns)
        print('Adding:')
        print(newColumns)
        for newColumn in newColumns:
          if newColumn in newDf.columns:
            try:
              newUniqueValCount = newDf.groupBy(newColumn).count().count()
              print("There are " + str(newUniqueValCount) + " unique values in " + str(newColumn))
            except:
              print('Uncaught error discussing ' + str(newColumn))
          #else:
          #  newColumns.remove(newColumn)

        print('Dropping:')
        print(set(oldColumns) - set(newDf.columns))

    else:
      if debug: print('Nothing done for column ' + str(colName))

      #end if countUnique == 1, elif countUnique other condition
    #end outer for
  return newDf