Dataframe Spark:遍历每行中的列以创建新的数据帧

Dataframe Spark:遍历每行中的列以创建新的数据帧,dataframe,apache-spark,pyspark,apache-spark-sql,user-defined-functions,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,User Defined Functions,假设我有这样一个数据帧: +-----------+-----------+-----------+-----------+------------+--+ | ColA | ColB | ColC | ColD | ColE | | +-----------+-----------+-----------+-----------+------------+--+ | '' | sample_1x | sample_1y | '

假设我有这样一个数据帧:

+-----------+-----------+-----------+-----------+------------+--+
|   ColA    |   ColB    |   ColC    |   ColD    |    ColE    |  |
+-----------+-----------+-----------+-----------+------------+--+
| ''        | sample_1x | sample_1y | ''        | sample_1z  |  |
| sample2_x | sample2_y | ''        | ''        | ''         |  |
| sample3_x | ''        | ''        | ''        | sample3_y  |  |
| sample4_x | sample4_y | ''        | sample4_z | sample4_zz |  |
| sample5_x | ''        | ''        | ''        | ''         |  |
+-----------+-----------+-----------+-----------+------------+--+
我想创建另一个数据框,在每行中从左到右显示关系,同时跳过具有空值的列。此外,只有1条有效列记录的行也将被排除。例如:

+-----------+------------+-----------+
|   From    |     To     |   Label   |
+-----------+------------+-----------+
| sample1_x | sample1_y  | ColB_ColC |
| sample1_y | sample1_z  | ColC_ColE |
| sample2_x | sample2_y  | ColA_ColB |
| sample3_x | sample3_y  | ColA_ColE |
| sample4_x | sample4_y  | ColA_ColB |
| sample4_y | sample4_z  | ColB_ColD |
| sample4_z | sample4_zz | ColD_ColE |
+-----------+------------+-----------+
from pyspark.sql import functions as F

df.show()
+---------+---------+---------+---------+----------+
|     ColA|     ColB|     ColC|     ColD|      ColE|
+---------+---------+---------+---------+----------+
|     null|sample_1x|sample_1y|     null| sample_1z|
|sample2_x|sample2_y|     null|     null|      null|
|sample3_x|     null|     null|     null| sample3_y|
|sample4_x|sample4_y|     null|sample4_z|sample4_zz|
|sample5_x|     null|     null|     null|      null|
+---------+---------+---------+---------+----------+

# columns that get involved, will group them into an array using F.array(cols)
cols = df.columns

# defind function to convert array into array of structs
def find_route(arr, cols):
    d = [ (cols[i],e) for i,e in enumerate(arr) if e is not None ]
    return [ {'From':d[i][1], 'To':d[i+1][1], 'Label':d[i][0]+'_'+d[i+1][0]} for i in range(len(d)-1) ]

# set up the UDF and add cols as an extra argument
udf_find_route = F.udf(lambda a: find_route(a, cols), 'array<struct<From:string,To:string,Label:string>>')

# retrive the data from the array of structs after array-explode
df.select(F.explode(udf_find_route(F.array(cols))).alias('c1')).select('c1.*').show()
+---------+----------+---------+
|     From|        To|    Label|
+---------+----------+---------+
|sample_1x| sample_1y|ColB_ColC|
|sample_1y| sample_1z|ColC_ColE|
|sample2_x| sample2_y|ColA_ColB|
|sample3_x| sample3_y|ColA_ColE|
|sample4_x| sample4_y|ColA_ColB|
|sample4_y| sample4_z|ColB_ColD|
|sample4_z|sample4_zz|ColD_ColE|
+---------+----------+---------+

我认为这种方法应该是编写一个包含这种逻辑的UDF,但我不完全确定如何返回一个全新的DF,因为我习惯于UDF只是在同一个DF中创建另一列。或者是否有另一个spark函数可以比创建UDF更容易地处理这种情况?如果需要,请使用pyspark。

主要使用Spark SQL:

df.createOrReplaceTempViewdf cols_df=df.columns qry=union.join[f 从中选择{enum_cols[1]}, {cols_df[enum_cols[0]+1]}至于, “{enum_cols[1]}{cols_df[enum_cols[0]+1]}”作为来自df的标签,其中{enum_cols[1]}和{cols_df[enum_cols[0]+1]} 如果enum_cols[0]您可以使用udf,它接受数组参数并返回结构数组,例如:

+-----------+------------+-----------+
|   From    |     To     |   Label   |
+-----------+------------+-----------+
| sample1_x | sample1_y  | ColB_ColC |
| sample1_y | sample1_z  | ColC_ColE |
| sample2_x | sample2_y  | ColA_ColB |
| sample3_x | sample3_y  | ColA_ColE |
| sample4_x | sample4_y  | ColA_ColB |
| sample4_y | sample4_z  | ColB_ColD |
| sample4_z | sample4_zz | ColD_ColE |
+-----------+------------+-----------+
from pyspark.sql import functions as F

df.show()
+---------+---------+---------+---------+----------+
|     ColA|     ColB|     ColC|     ColD|      ColE|
+---------+---------+---------+---------+----------+
|     null|sample_1x|sample_1y|     null| sample_1z|
|sample2_x|sample2_y|     null|     null|      null|
|sample3_x|     null|     null|     null| sample3_y|
|sample4_x|sample4_y|     null|sample4_z|sample4_zz|
|sample5_x|     null|     null|     null|      null|
+---------+---------+---------+---------+----------+

# columns that get involved, will group them into an array using F.array(cols)
cols = df.columns

# defind function to convert array into array of structs
def find_route(arr, cols):
    d = [ (cols[i],e) for i,e in enumerate(arr) if e is not None ]
    return [ {'From':d[i][1], 'To':d[i+1][1], 'Label':d[i][0]+'_'+d[i+1][0]} for i in range(len(d)-1) ]

# set up the UDF and add cols as an extra argument
udf_find_route = F.udf(lambda a: find_route(a, cols), 'array<struct<From:string,To:string,Label:string>>')

# retrive the data from the array of structs after array-explode
df.select(F.explode(udf_find_route(F.array(cols))).alias('c1')).select('c1.*').show()
+---------+----------+---------+
|     From|        To|    Label|
+---------+----------+---------+
|sample_1x| sample_1y|ColB_ColC|
|sample_1y| sample_1z|ColC_ColE|
|sample2_x| sample2_y|ColA_ColB|
|sample3_x| sample3_y|ColA_ColE|
|sample4_x| sample4_y|ColA_ColB|
|sample4_y| sample4_z|ColB_ColD|
|sample4_z|sample4_zz|ColD_ColE|
+---------+----------+---------+

在三重引号查询之前应该有一个f吗?我得到了一个错误,在删除它并运行之后,我得到了一个ParseException,它以输入'select{'line 1,pos 7\n\n==SQL==\n选择{enum_cols[1]}作为From,\n----^^^^\n{cols_df[enum_cols[0]+1]}作为标签,{enum_cols[1]}{enum_cols[0]+1]}[枚举列[0]+1]}哦,我看到f使它成为一个格式化的字符串文字,嗯,我仍然得到语法错误,虽然f是用于字符串插值的,我只是修改了答案,将两列值括起来。它现在起作用了吗?还是你仍然得到错误?如果是,哪一个?是的,语法错误指向三引号字符串t的结尾o只需尝试通过pyspark函数构造等价的sql