从HBase检索数据并将其格式化为scala数据帧
我正试图将hbase表中的数据导入ApacheSpark环境,但我不知道如何格式化它。谁能帮帮我吗从HBase检索数据并将其格式化为scala数据帧,scala,apache-spark,apache-spark-sql,hbase,Scala,Apache Spark,Apache Spark Sql,Hbase,我正试图将hbase表中的数据导入ApacheSpark环境,但我不知道如何格式化它。谁能帮帮我吗 case class systems( rowkey: String, iacp: Option[String], temp: Option[String]) type Record = (String, Option[String], Option[String]) val hBaseRDD_iacp = sc.hbaseTable[Record]("test_table").select(
case class systems( rowkey: String, iacp: Option[String], temp: Option[String])
type Record = (String, Option[String], Option[String])
val hBaseRDD_iacp = sc.hbaseTable[Record]("test_table").select("iacp","temp").inColumnFamily("test_fam")
scala> hBaseRDD_iacp.map(x => systems(x._1,x._2,x._3)).toDF().show()
+--------------+-----------------+--------------------+
| rowkey| iacp| temp|
+--------------+-----------------+--------------------+
| ab7|0.051,0.052,0.055| 17.326,17.344,17.21|
| k6c| 0.056,NA,0.054|17.277,17.283,17.256|
| ad| NA,23.0| 24.0,23.6|
+--------------+-----------------+--------------------+
但是,我实际上希望它的格式如下。每个逗号分隔的值都在新行中,每个NA都被null值替换。iacp和temp列中的值应为浮点型。每行可以有不同数量的逗号分隔值
提前谢谢
+--------------+-----------------+--------------------+
| rowkey| iacp| temp|
+--------------+-----------------+--------------------+
| ab7| 0.051| 17.326|
| ab7| 0.052| 17.344|
| ab7| 0.055| 17.21|
| k6c| 0.056| 17.277|
| k6c| null| 17.283|
| k6c| 0.054| 17.256|
| ad| null| 24.0|
| ad| 23| 26.0|
+--------------+-----------------+--------------------+
您的
hBaseRDD_iacp.map(x=>systems(x._1,x._2,x._3))。toDF
代码行应生成与以下内容等效的数据帧:
val df = Seq(
("ab7", Some("0.051,0.052,0.055"), Some("17.326,17.344,17.21")),
("k6c", Some("0.056,NA,0.054"), Some("17.277,17.283,17.256")),
("ad", Some("NA,23.0"), Some("24.0,23.6"))
).toDF("rowkey", "iacp", "temp")
要将数据集转换为所需结果,可以应用一个UDF,该UDF将iacp
和temp
CSV字符串的元素配对,以生成(选项[Double],选项[Double])
的数组,然后将其分解,如下所示:
import org.apache.spark.sql.functions._
import spark.implicits._
def pairUpCSV = udf{ (s1: String, s2: String) =>
import scala.util.Try
def toNumericArr(csv: String) = csv.split(",").map{
case s if Try(s.toDouble).isSuccess => Some(s)
case _ => None
}
toNumericArr(s1).zipAll(toNumericArr(s2), None, None)
}
df.
withColumn("csv_pairs", pairUpCSV($"iacp", $"temp")).
withColumn("csv_pair", explode($"csv_pairs")).
select($"rowkey", $"csv_pair._1".as("iacp"), $"csv_pair._2".as("temp")).
show(false)
// +------+-----+------+
// |rowkey|iacp |temp |
// +------+-----+------+
// |ab7 |0.051|17.326|
// |ab7 |0.052|17.344|
// |ab7 |0.055|17.21 |
// |k6c |0.056|17.277|
// |k6c |null |17.283|
// |k6c |0.054|17.256|
// |ad |null |24.0 |
// |ad |23.0 |23.6 |
// +------+-----+------+
请注意,值NA
属于方法toNumericArr
中的默认情况,因此不会单独作为一种情况列出。此外,UDF中使用了zipAll
(而不是zip
),以涵盖iacp
和temp
CSV字符串具有不同元素大小的情况