Scala Spark数据帧在OneHotEncoder中处理空字符串_Scala_Apache Spark_Apache Spark Mllib_Apache Spark Ml_Spark Csv

Scala Spark数据帧在OneHotEncoder中处理空字符串

scala apache-spark

Scala Spark数据帧在OneHotEncoder中处理空字符串,scala,apache-spark,apache-spark-mllib,apache-spark-ml,spark-csv,Scala,Apache Spark,Apache Spark Mllib,Apache Spark Ml,Spark Csv,我正在将一个CSV文件（使用spark CSV）导入一个数据框，该数据框具有空字符串值。应用OneHotEncoder时，应用程序崩溃，出现错误要求失败：名称不能为空字符串。。有什么办法可以让我绕过这件事吗我可以在页面中重现错误： val df=sqlContext.createDataFrame（Seq( （0，“a”），（1，“b”），（2，“c”），（3，“”，//是的，这有点棘手，但也许您可以用与其他值不同的内容替换空字符串。请注意，我使用的是pyspark DataFrameN

我正在将一个CSV文件（使用spark CSV）导入一个

数据框

，该数据框具有空

字符串

值。应用

OneHotEncoder

时，应用程序崩溃，出现错误

要求失败：名称不能为空字符串。

。有什么办法可以让我绕过这件事吗

我可以在页面中重现错误：

val df=sqlContext.createDataFrame（Seq(
（0，“a”），
（1，“b”），
（2，“c”），
（3，“”，//是的，这有点棘手，但也许您可以用与其他值不同的内容替换空字符串。请注意，我使用的是pyspark DataFrameNaFunctions API，但应该类似
df = sqlContext.createDataFrame([(0,"a"), (1,'b'), (2, 'c'), (3,''), (4,'a'), (5, 'c')], ['id', 'category'])
df = df.na.replace('', 'EMPTY', 'category')
df.show()

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       b|
|  2|       c|
|  3|   EMPTY|
|  4|       a|
|  5|       c|
+---+--------+

由于onehotcoder
/onehotcoderestimator
不接受空字符串作为名称，否则将出现以下错误：
java.lang.IllegalArgumentException:要求失败：名称不能为空字符串。
在scala.Predef$.require处（Predef.scala:233）
位于org.apache.spark.ml.attribute.attribute$$anonfun$5.apply（attributes.scala:33）
位于org.apache.spark.ml.attribute.attribute$$anonfun$5.apply（attributes.scala:32）
[……]
这就是我将要做的：（还有其他的方法，rf。@Anthony的回答）
我将创建一个UDF
来处理空类别：
import org.apache.spark.sql.functions._

def processMissingCategory = udf[String, String] { s => if (s == "") "NA"  else s }

然后，我将在列上应用UDF：
val df = sqlContext.createDataFrame(Seq(
   (0, "a"),
   (1, "b"),
   (2, "c"),
   (3, ""),         //<- original example has "a" here
   (4, "a"),
   (5, "c")
)).toDF("id", "category")
  .withColumn("category",processMissingCategory('category))

df.show
// +---+--------+
// | id|category|
// +---+--------+
// |  0|       a|
// |  1|       b|
// |  2|       c|
// |  3|      NA|
// |  4|       a|
// |  5|       c|
// +---+--------+

我希望这会有所帮助！如果列包含null，OneHotEncoder将因null指针异常而失败。
因此，我也将udf扩展为tanslate null值
object OneHotEncoderExample {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("OneHotEncoderExample Application").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    // $example on$
    val df1 = sqlContext.createDataFrame(Seq(
      (0.0, "a"),
      (1.0, "b"),
      (2.0, "c"),
      (3.0, ""),
      (4.0, null),
      (5.0, "c")
    )).toDF("id", "category")


    import org.apache.spark.sql.functions.udf
    def emptyValueSubstitution = udf[String, String] {
      case "" => "NA"
      case null => "null"
      case value => value
    }
    val df = df1.withColumn("category", emptyValueSubstitution( df1("category")) )


    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)
    val indexed = indexer.transform(df)
    indexed.show()

    val encoder = new OneHotEncoder()
      .setInputCol("categoryIndex")
      .setOutputCol("categoryVec")
      .setDropLast(false)
    val encoded = encoder.transform(indexed)
    encoded.show()
    // $example off$
    sc.stop()
  }
}

这里真的需要.na
吗？不仅仅是df.replace（'''EMPTY'，'category'））work？谢谢David的提问。DataFrameNaFunctions.replace和DataFrame.replace是相同的别名。在这里，我们知道确切的列名。如果我们不知道列名怎么办？也就是说，如果我试图将选项卡文件列表加载到表中，如何将空值替换为其他内容？
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// |  0|       a|          0.0|
// |  1|       b|          2.0|
// |  2|       c|          1.0|
// |  3|      NA|          3.0|
// |  4|       a|          0.0|
// |  5|       c|          1.0|
// +---+--------+-------------+

// Spark <2.3
// val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
// Spark +2.3
val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
val encoded = encoder.transform(indexed)

encoded.show
// +---+--------+-------------+-------------+
// | id|category|categoryIndex|  categoryVec|
// +---+--------+-------------+-------------+
// |  0|       a|          0.0|(3,[0],[1.0])|
// |  1|       b|          2.0|(3,[2],[1.0])|
// |  2|       c|          1.0|(3,[1],[1.0])|
// |  3|      NA|          3.0|    (3,[],[])|
// |  4|       a|          0.0|(3,[0],[1.0])|
// |  5|       c|          1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+

df.na.replace("category", Map( "" -> "NA")).show
// +---+--------+
// | id|category|
// +---+--------+
// |  0|       a|
// |  1|       b|
// |  2|       c|
// |  3|      NA|
// |  4|       a|
// |  5|       c|
// +---+--------+

object OneHotEncoderExample {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("OneHotEncoderExample Application").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    // $example on$
    val df1 = sqlContext.createDataFrame(Seq(
      (0.0, "a"),
      (1.0, "b"),
      (2.0, "c"),
      (3.0, ""),
      (4.0, null),
      (5.0, "c")
    )).toDF("id", "category")


    import org.apache.spark.sql.functions.udf
    def emptyValueSubstitution = udf[String, String] {
      case "" => "NA"
      case null => "null"
      case value => value
    }
    val df = df1.withColumn("category", emptyValueSubstitution( df1("category")) )


    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)
    val indexed = indexer.transform(df)
    indexed.show()

    val encoder = new OneHotEncoder()
      .setInputCol("categoryIndex")
      .setOutputCol("categoryVec")
      .setDropLast(false)
    val encoded = encoder.transform(indexed)
    encoded.show()
    // $example off$
    sc.stop()
  }
}