Scala 如何使用基于案例类的数据集解析CSV?
我正在尝试用新的Spark 1.6.0 API数据集解析CSV。无论如何,我有一些问题要做。我想为每个CSV行创建一个案例类 代码如下:Scala 如何使用基于案例类的数据集解析CSV?,scala,apache-spark,Scala,Apache Spark,我正在尝试用新的Spark 1.6.0 API数据集解析CSV。无论如何,我有一些问题要做。我想为每个CSV行创建一个案例类 代码如下: case class MyData (forename:String, surname:String, age:Integer) def toMyData(text: String): Dataset[MyData] = { val splits: Array[String] = text.split("\t") Seq(My
case class MyData (forename:String, surname:String, age:Integer)
def toMyData(text: String): Dataset[MyData] = {
val splits: Array[String] = text.split("\t")
Seq(MyData(
forename = splits(0),
surname = splits(1),
age = splits(2).asInstanceOf[Integer]
)).toDS()
}
val lines:Dataset[MyData] = sqlContext.read.text("/data/mydata.csv").as[MyData]
lines.map(r => toMyData(r)).foreach(println)
我的toMyData只是一种编码器,但我不知道如何按照API正确地进行编码
有什么想法吗
编辑:
我以这种方式更改了代码,但我甚至无法使其编译:
val lines:Dataset[MyData] = sqlContext.read.text("/data/mydata.csv").as[MyData]
lines.map(r => toMyData(r)).foreach(println)
def toMyData(text: String): Dataset[MyData] = {
val df = sc.parallelize(Seq(text)).toDF("value")
df.map(_.getString(0).split("\t") match {
case Array(fn, sn, age) =>
MyData(fn, sn, age.asInstanceOf[Integer])
}).toDS
}
sqlContext.read.text("/data/mydata.csv").as[String].map(r => toMyData(r)).collect().foreach(println)
正如我得到的:
Error:(50, 10) value toDS is not a member of org.apache.spark.rdd.RDD[MyData]
possible cause: maybe a semicolon is missing before `value toDS'?
}).toDS
^
Error:(54, 133) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
sqlContext.read.text("/data/mydata.csv").as[String].map(r => toMyData(r)).collect().foreach(println)
忽略格式验证和异常处理:
// Simulate sqlContext.read.text("/data/mydata.csv")
val df = sc.parallelize(Seq("John\tDoe\t22")).toDF("value")
df.rdd.map(_.getString(0).split("\t") match {
case Array(fn, sn, age) => MyData(fn, sn, age.toInt)
}).toDS
或者不转换为RDD:
import org.apache.spark.sql.functions.regexp_extract
val pattern = "^(.*?)\t(.*?)\t(.*)$"
val exprs = Seq(
(1, "forename", "string"), (2, "surname", "string"), (3, "age", "integer")
).map{case (i, n, t) => regexp_extract($"value", pattern, i).alias(n).cast(t)}
df
.select(exprs: _*) // Convert to (StringType, StringType, IntegerType)
.as[MyData] // cast
总结:
不要使用嵌套动作、转换或DDS。
在使用前,请先阅读asInstanceOf的工作原理。这里不适用。
我认为方法.as[String].transformr=>toMyDatar有任何意义。不管怎样,我也要试试你的解决办法。ThankstoMyData放在转换中根本无法工作。数据集是分布式结构,不能嵌套。不,它不是。仔细检查类型。Map函数的类型为Row=>MyData而不是Row=>DataSet[MyData]。@zero323答案非常有效,您只需更改val df=。。到sqlContext.read.text获取每行带有字符串的DF?