Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala spark OneHotEncoder-如何排除用户定义的类别?_Scala_Apache Spark_One Hot Encoding - Fatal编程技术网

Scala spark OneHotEncoder-如何排除用户定义的类别?

Scala spark OneHotEncoder-如何排除用户定义的类别?,scala,apache-spark,one-hot-encoding,Scala,Apache Spark,One Hot Encoding,考虑以下spark数据帧: df.printSchema() |-- predictor: double (nullable = true) |-- label: double (nullable = true) |-- date: string (nullable = true) df.show(6) predictor label date 4.23 6.33

考虑以下spark数据帧:

df.printSchema()

     |-- predictor: double (nullable = true)
     |-- label: double (nullable = true)
     |-- date: string (nullable = true)

df.show(6)

    predictor      label              date    
    4.23           6.33               20160510
    4.77           7.18               20160510
    4.09           5.94               20160511
    4.23           6.33               20160511
    4.77           7.18               20160512
    4.09           5.94               20160512
基本上,我的数据帧由具有每日频率的数据组成。我需要将日期列映射到二进制向量列。这很容易使用StringIndexer和OneHotEncoder实现:

val dateIndexer = new StringIndexer()
  .setInputCol("date")
  .setOutputCol("dateIndex")
  .fit(df)
val indexed = dateIndexer.transform(df)

val encoder = new OneHotEncoder()
  .setInputCol("dateIndex")
  .setOutputCol("date_codeVec")

val encoded = encoder.transform(indexed)
我的问题是。但是,我需要删除与数据帧中的第一个日期相关的类别(上例中为20160510),因为我需要计算相对于第一个日期的时间趋势


对于上面的示例(请注意,我的数据帧中有100多个日期),如何实现这一点?

您可以尝试将
setDropLast
设置为
false

val encoder = new OneHotEncoder()
  .setInputCol("dateIndex")
  .setOutputCol("date_codeVec")
  .setDropLast(false)

val encoded = encoder.transform(indexed)
并使用
矢量切片器手动删除电平选择:

import org.apache.spark.ml.feature.VectorSlicer

val slicer = new VectorSlicer()
  .setInputCol("date_codeVec")
  .setOutputCol("data_codeVec_selected")
  .setNames(dateIndexer.labels.diff(Seq(dateIndexer.labels.min)))

slicer.transform(encoded)
+---------+-----+--------+---------+-------------+---------------------+
|预测器|标签|日期|日期索引|日期|代码向量|数据|代码向量|已选择|
+---------+-----+--------+---------+-------------+---------------------+
|     4.23| 6.33|20160510|      0.0|(3,[0],[1.0])|            (2,[],[])|
|     4.77| 7.18|20160510|      0.0|(3,[0],[1.0])|            (2,[],[])|
|     4.09| 5.94|20160511|      2.0|(3,[2],[1.0])|        (2,[1],[1.0])|
|     4.23| 6.33|20160511|      2.0|(3,[2],[1.0])|        (2,[1],[1.0])|
|     4.77| 7.18|20160512|      1.0|(3,[1],[1.0])|        (2,[0],[1.0])|
|     4.09| 5.94|20160512|      1.0|(3,[1],[1.0])|        (2,[0],[1.0])|
+---------+-----+--------+---------+-------------+---------------------+