Apache spark 在一列中获取spark dataframe的所有非空列
我需要从配置单元表中选择所有NOTNULLS列,并将它们插入Hbase。例如,考虑下表:Apache spark 在一列中获取spark dataframe的所有非空列,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我需要从配置单元表中选择所有NOTNULLS列,并将它们插入Hbase。例如,考虑下表: Name Place Department Experience ============================================== Ram | Ramgarh | Sales | 14 Lakshman | Lakshmanpur |Operations | Sita | Sitapur |
Name Place Department Experience
==============================================
Ram | Ramgarh | Sales | 14
Lakshman | Lakshmanpur |Operations |
Sita | Sitapur | | 14
Ravan | | | 25
我必须将上表中的所有非空列写入Hbase。因此,我编写了一个逻辑,以在dataframe的一列中获取NOTNULL列,如下所示。“名称”列是必需的
Name Place Department Experience Not_null_columns
================================================================================
Ram Ramgarh Sales 14 Name, Place, Department, Experience
Lakshman Lakshmanpur Operations Name, Place, Department
Sita Sitapur 14 Name, Place, Experience
Ravan 25 Name, Experience
现在,我的要求是在dataframe中创建一个列,在单个列中包含所有非空列的值,如下所示
Name Place Department Experience Not_null_columns_values
Ram Ramgarh Sales 14 Name: Ram, Place: Ramgarh, Department: Sales, Experince: 14
Lakshman Lakshmanpur Operations Name: Lakshman, Place: Lakshmanpur, Department: Operations
Sita Sitapur 14 Name: Sita, Place: Sitapur, Experience: 14
Ravan 25 Name: Ravan, Experience: 25
一旦我超过df,我将把它写到Hbase中,名称作为键,最后一列作为值
请让我知道是否有更好的方法来做到这一点 试试这个-
加载提供的测试数据
val数据=
|姓名|地点|部门|经验
|
|拉姆|拉姆加尔|销售| 14
|
|Lakshman | Lakshmanpur |运营|
|
|锡塔|锡塔普尔| 14
|
|拉万| | | 25
.条纹边缘
val stringDS=data.splitSystem.lineSeparator
.map|.split\\|.map|.replaceAll^[\t]+|[\t]+$,.mkString,
.toSeq.toDS
val df=spark.read
.选项SEP,
.optioninferSchema,对
.optionheader,true
//.optionnullValue,null
.csvstringDS
df.showfalse
打印模式
/**
* +----+------+-----+-----+
*|姓名|地点|部门|经验|
* +----+------+-----+-----+
*|拉姆|拉姆加尔|销售| 14|
*|拉克希曼|拉克希曼普尔|运营|无效|
*| Sita | Sitapur | null | 14|
*| Ravan |空|空| 25|
* +----+------+-----+-----+
*
*根
*|-Name:string nullable=true
*|-Place:string nullable=true
*|-Department:string nullable=true
*|-经验:整数可为空=真
*/
先转换结构,然后转换json
val x=df.withColumnNot_null_columns_值,
要_jsonstructdf.columns.mapcol:_*
x、 秀假
x、 打印模式
/**
* +----+------+-----+-----+-----------------------------------+
*|名称|地点|部门|经验|非|空|列|值|
* +----+------+-----+-----+-----------------------------------+
*| Ram | Ramgarh |销售| 14 |{姓名:Ram,地点:Ramgarh,部门:销售,经验:14}|
*| Lakshman | Lakshmanpur | Operations | null |{姓名:Lakshman,地点:Lakshmanpur,部门:Operations}|
*| Sita | Sitapur | null | 14 |{姓名:Sita,地点:Sitapur,经历:14}|
*| Ravan | null | null | 25 |{姓名:Ravan,经历:25}|
* +----+------+-----+-----+-----------------------------------+
*/
谢谢你,这很有魅力。还在努力理解逻辑。作为有火花的新手。