scala中RDD的一种热编码

scala中RDD的一种热编码,scala,apache-spark,Scala,Apache Spark,我有一个来自MovieSense ml-100K数据集的用户数据 示例行为- 1|24|M|technician|85711 2|53|F|other|94043 3|23|M|writer|32067 4|24|M|technician|43537 5|33|F|other|15213 我已将数据读取为RDD,如下所示- scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user

我有一个来自MovieSense ml-100K数据集的用户数据

示例行为-

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
我已将数据读取为RDD,如下所示-

scala> val user_data =  sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29

scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician,    43537), Array(5, 33, F, other, 15213))


# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31

scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7),  (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))
如何在scala中的RDD上实现这一点。 我想在RDD上执行操作,而不将其转换为数据帧

有什么帮助吗


谢谢

我是这样做的-

1) 读取用户数据-

scala> val user_data =  sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
2) 显示5行数据-

scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician,    43537), Array(5, 33, F, other, 15213))
3) 通过索引创建专业地图-

scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()

scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)
4) 创建编码函数,该函数执行一个专业的热编码

scala> def encode(x: String) =
 |{
 | var encodeArray = Array.fill(21)(0)
 | encodeArray(indexed_user.get(x).get.toInt)=1
 | encodeArray
 }
5) 对用户数据应用编码功能-

scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}
6) 显示编码数据-

scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] = 

1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)), 
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)), 
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)), 
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)), 
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)), 
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))

我是这样做的-

1) 读取用户数据-

scala> val user_data =  sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
2) 显示5行数据-

scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician,    43537), Array(5, 33, F, other, 15213))
3) 通过索引创建专业地图-

scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()

scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)
4) 创建编码函数,该函数执行一个专业的热编码

scala> def encode(x: String) =
 |{
 | var encodeArray = Array.fill(21)(0)
 | encodeArray(indexed_user.get(x).get.toInt)=1
 | encodeArray
 }
5) 对用户数据应用编码功能-

scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}
6) 显示编码数据-

scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] = 

1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)), 
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)), 
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)), 
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)), 
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)), 
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))

[我的解决方案是针对Dataframe]下面的内容应该有助于将分类映射转换为一个热点。您必须创建一个map
catMap
对象,其中键作为列名,值作为类别列表

    var OutputDf = df
        for (cat <- catMap.keys) {
          val categories = catMap(cat)
        for (oneHotVal <- categories) {
          OutputDf = OutputDf.withColumn(oneHotVal, 
            when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
                                          }
                }
    OutputDf
var OutputDf=df

对于(cat[我的解决方案是针对Dataframe]的]这将有助于将分类映射转换为一个热点。您必须创建一个map
catMap
对象,其中键作为列名,值作为类别列表

    var OutputDf = df
        for (cat <- catMap.keys) {
          val categories = catMap(cat)
        for (oneHotVal <- categories) {
          OutputDf = OutputDf.withColumn(oneHotVal, 
            when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
                                          }
                }
    OutputDf
var OutputDf=df

对于(在向下投票之前,请让用户发布完整的问题。在互联网断开连接后,无意中发布了不完整的问题。您不想使用Spark的默认一个热编码器的任何原因。请参阅:或在Spark2 dataframe API中:)。不知何故,我跳过了此线程…将在否决投票前尝试此方法。请让用户发布完整的问题。在internet断开连接后,无意中发布了不完整的问题。您不想使用Spark的默认一个热编码器的任何原因。请参阅:或在Spark2数据帧API中:).不知怎的,我跳过了这个帖子…我会尝试这个更好的解决方案。请提供更好的解决方案。请寄