Apache spark 如何在ApacheSpark中将文本和数字数据混合转换为特征数据

Apache spark 如何在ApacheSpark中将文本和数字数据混合转换为特征数据,apache-spark,apache-spark-mllib,feature-selection,Apache Spark,Apache Spark Mllib,Feature Selection,我有一个文本和数字数据的CSV。我需要将其转换为Spark中的特征向量数据(双值)。有办法吗 我看到一些,例如,每个关键字都映射到某个双值,并使用它进行转换。但是,如果有多个关键字,则很难这样做 还有别的出路吗?我看到Spark提供了将转换为特征向量的提取器。有人能举个例子吗 48, Private, 105808, 9th, 5, Widowed, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, >50

我有一个文本和数字数据的CSV。我需要将其转换为Spark中的特征向量数据(双值)。有办法吗

我看到一些,例如,每个关键字都映射到某个双值,并使用它进行转换。但是,如果有多个关键字,则很难这样做

还有别的出路吗?我看到Spark提供了将转换为特征向量的提取器。有人能举个例子吗

48, Private, 105808, 9th, 5, Widowed, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, >50K
42, Private, 169995, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K
48,私人,105808,9,5,丧偶,搬家,未婚,白人,男性,0,0,40,美国,>50K

42,私人,169995,一些大学,10岁,已婚公民配偶,教授专业,丈夫,白人,男性,0,0,45,美国,最后我这样做了。我遍历每个数据,制作一个以key作为每个项的映射,并增加一个双计数器

def createMap(data: RDD[String]) : Map[String,Double] = {  
 var mapData:Map[String,Double] = Map()
 var counter = 0.0
 data.collect().foreach{ item => 
  counter = counter +1
  mapData += (item -> counter)
 }
 mapData
}

def getLablelValue(input: String): Int = input match {
 case "<=50K" => 0
 case ">50K" => 1
}


val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd  = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct

val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)


val featureVector = census.map{line => 
  val fields = line.split(", ")
 LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble,  orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}
def createMap(数据:RDD[String]):Map[String,Double]={
var-mapData:Map[String,Double]=Map()
变量计数器=0.0
data.collect().foreach{item=>
计数器=计数器+1
mapData+=(项目->计数器)
}
地图数据
}
def GetLabelValue(输入:字符串):Int=输入匹配{
案例“50K”=>1
}
val census=sc.textFile(“/user/cloudera/census_data.txt”)
val orgTypeRdd=census.map(line=>line.split(“,”)(1)).distinct
val gradeTypeRdd=census.map(line=>line.split(“,”)(3)).distinct
val marStatusRdd=census.map(line=>line.split(“,”)(5)).distinct
val jobTypeRdd=census.map(line=>line.split(“,”)(6)).distinct
val familystaturdd=census.map(line=>line.split(“,”)(7)).distinct
val raceTypeRdd=census.map(line=>line.split(“,”)(8)).distinct
val genderTypeRdd=census.map(line=>line.split(“,”)(9)).distinct
val countryRdd=census.map(line=>line.split(“,”)(13)).distinct
val salaryRange=census.map(line=>line.split(“,”)(14)).distinct
val orgTypeMap=createMap(orgTypeRdd)
val gradeTypeMap=createMap(gradeTypeRdd)
val marStatusMap=createMap(marStatusRdd)
val jobTypeMap=createMap(jobTypeRdd)
val familyStatusMap=createMap(familyStatusRdd)
val raceTypeMap=createMap(raceTypeRdd)
val genderTypeMap=createMap(genderTypeRdd)
val countryMap=createMap(countryRdd)
val salaryRangeMap=createMap(salaryRange)
val featureVector=census.map{line=>
val字段=行。拆分(“,”)
LabeledPoint(GetLableValue(fields(14).toString)、Vectors.dense(fields(0).toDouble、orgTypeMap(fields(1).toString)、fields(2).toDouble、gradeTypeMap(fields(3).toString)、fields(4).toDouble、marStatusMap(fields(5).toString)、jobTypeMap(fields(6).toString)、familyStatusMap(fields(fields(7).toString)、raceTypeMap(fields(fields(fields(8).toString)、genderTypeMap(字段(9).toString)、字段(10).toDouble、字段(11).toDouble、字段(12).toDouble、countryMap(字段(13).toString)、salaryRangeMap(字段(14).toString)))
}

您是否检查过StringIndexer(是允许使用ML,还是严格使用MLLIB?)如果这些数据混合了文本数据和数字数据,我更喜欢使用MLLIB api。有没有办法将其转换为特征向量?这些数据混合了文本数据和数字数据。有没有办法将其转换为特征向量?