Apache spark 在拼花文件中将结构列名转换为行的步骤_Apache Spark_Struct_Pyspark_Apache Spark Sql_Parquet

Apache spark 在拼花文件中将结构列名转换为行的步骤

apache-spark struct pyspark

Apache spark 在拼花文件中将结构列名转换为行的步骤,apache-spark,struct,pyspark,apache-spark-sql,parquet,Apache Spark,Struct,Pyspark,Apache Spark Sql,Parquet,我有一个示例json数据文件，如下所示： {"data_id":"1234","risk_characteristics":{"indicators":{"alcohol":true,"house":true,"business":true,"familyname":true,"swimming_pool":true}

我有一个示例json数据文件，如下所示：

{"data_id":"1234","risk_characteristics":{"indicators":{"alcohol":true,"house":true,"business":true,"familyname":true,"swimming_pool":true}}}
{"data_id":"6789","risk_characteristics":{"indicators":{"alcohol":true,"house":true,"business":false,"familyname":true}}}
{"data_id":"5678","risk_characteristics":{"indicators":{"alcohol":false}}}

我使用下面的代码将json文件转换为parquet并加载到hive中

dataDF = spark.read.json("path/Datasmall.json")
dataDF.write.parquet("data.parquet")
parqFile = spark.read.parquet("data.parquet")
parqFile.write.saveAsTable("indicators_table", format='parquet', mode='append', path='/externalpath/indicators_table/')

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
fromHiveDF = hive_context.table("default.indicators_table")
fromHiveDF.show()

indicatorsDF = fromHiveDF.select('data_id', 'risk_characteristics.indicators')
indicatorsDF.printSchema()

root
 |-- data_id: string (nullable = true)
 |-- indicators: struct (nullable = true)
 |    |-- alcohol: boolean (nullable = true)
 |    |-- house: boolean (nullable = true)
 |    |-- business: boolean (nullable = true)
 |    |-- familyname: boolean (nullable = true)

indicatorsDF.show()
+-------+--------------------+
|data_id|          indicators|
+-------+--------------------+
|   1234|[true, true, true...|
|   6789|[true, false, tru...|
|   5678|         [false,,,,]|
+-------+--------------------+

而不是像select data_id、indicators.酒精、indicators.house等那样检索数据，我只想得到一个只有3列以下的拼花地板数据文件。也就是说，结构字段将转换为“类型”列名下的行

data_id      indicators_type     indicators_value
1234         alcohol             T
1234         house               T
1234         business            T
1234         familyname          T
1234         swimming_ppol       T
6789         alcohol             T
6789         house               F
6789         business            T
6789         familyname          F
5678         alcohol             F

请问怎么做。我正在尝试使用pyspark来完成这项工作。还有一种方法可以在不硬编码文字细节的情况下实现这一点。在我的实际数据中，结构数据可以扩展到familyname之外，甚至可以扩展到100个

非常感谢

使用

堆栈

来堆栈列：

df.show()
+-------+--------------------------+
|data_id|indicators                |
+-------+--------------------------+
|1234   |[true, true, false, true] |
|6789   |[true, false, true, false]|
+-------+--------------------------+

stack_expr = 'stack(' + str(len(df.select('indicators.*').columns)) + ', ' + ', '.join(["'%s', indicators.%s" % (col,col) for col in df.select('indicators.*').columns]) + ') as (indicators_type, indicators_value)'

df2 = df.selectExpr(
    'data_id',
    stack_expr
)

df2.show()
+-------+---------------+----------------+
|data_id|indicators_type|indicators_value|
+-------+---------------+----------------+
|   1234|        alcohol|            true|
|   1234|          house|            true|
|   1234|       business|           false|
|   1234|     familyname|            true|
|   6789|        alcohol|            true|
|   6789|          house|           false|
|   6789|       business|            true|
|   6789|     familyname|           false|
+-------+---------------+----------------+

使用explode的另一种解决方案：

val df = spark.sql(""" with t1( 
select 1234 data_id, named_struct('alcohol',true, 'house',false, 'business', true, 'familyname', false) as indicators
union
select 6789 data_id, named_struct('alcohol',true, 'house',false, 'business', true, 'familyname', false) as indicators
) 
select * from t1
""")

df.show(false)
df.printSchema

+-------+--------------------------+
|data_id|indicators                |
+-------+--------------------------+
|6789   |[true, false, true, false]|
|1234   |[true, false, true, false]|
+-------+--------------------------+

root
 |-- data_id: integer (nullable = false)
 |-- indicators: struct (nullable = false)
 |    |-- alcohol: boolean (nullable = false)
 |    |-- house: boolean (nullable = false)
 |    |-- business: boolean (nullable = false)
 |    |-- familyname: boolean (nullable = false)


val df2 = df.withColumn("x", explode(array( 
          map(lit("alcohol") ,col("indicators.alcohol")),
          map(lit("house"), col("indicators.house")),
          map(lit("business"), col("indicators.business")),
          map(lit("familyname"), col("indicators.familyname"))
                          )))

df2.select(col("data_id"),map_keys(col("x"))(0), map_values(col("x"))(0)).show

+-------+--------------+----------------+
|data_id|map_keys(x)[0]|map_values(x)[0]|
+-------+--------------+----------------+
|   6789|       alcohol|            true|
|   6789|         house|           false|
|   6789|      business|            true|
|   6789|    familyname|           false|
|   1234|       alcohol|            true|
|   1234|         house|           false|
|   1234|      business|            true|
|   1234|    familyname|           false|
+-------+--------------+----------------+

更新-1：

要动态获取indicators结构列，请使用以下方法

val colsx = df.select("indicators.*").columns

colsx: Array[String] = Array(alcohol, house, business, familyname)

val exp1 = colsx.map( x => s""" map("${x}", indicators.${x})  """ ).mkString(",")

val exp2 = " explode(array( " + exp1 + " )) "

val df2 = df.withColumn("x",expr(exp2))

df2.select(col("data_id"),map_keys(col("x"))(0).as("indicator_key"), map_values(col("x"))(0).as("indicator_value")).show

+-------+-------------+---------------+
|data_id|indicator_key|indicator_value|
+-------+-------------+---------------+
|   6789|      alcohol|           true|
|   6789|        house|          false|
|   6789|     business|           true|
|   6789|   familyname|          false|
|   1234|      alcohol|           true|
|   1234|        house|          false|
|   1234|     business|           true|
|   1234|   familyname|          false|
+-------+-------------+---------------+

发布样本数据？@Srinivas我已经添加了样本数据这很有帮助，但是有没有一种方法可以在不硬编码文本细节的情况下实现这一点。在我的实际数据中，结构数据可以扩展到familyname之外，甚至可以扩展到100个。我将把这一点也添加到主要部分。你能检查一下答案中的更新吗？@sabra2121。。你能检查一下更新吗？有没有一种方法可以在不硬编码文字细节的情况下实现这一点。在我的实际数据中，结构数据可以扩展到familyname之外，甚至可以是100个。@sabra2121请参见编辑的答案？该解决方案与示例输入数据完美配合。但是，我的底层文件是拼花格式的（我将原始json数据转换为拼花格式）。当我在底层拼花数据上应用相同的代码时，我得到了以下错误：raise ANALYSISCEPTION（s.split（“：”，1）[1]，stackTrace）pyspark.sql.utils.AnalysisException:u“无法解析”堆栈（39，'alcool'，

指示器

酒精

，…]parquet\n”我已经用输入数据的更多细节更新了描述。非常感谢@mck！发布更新后的表达式：stack_expr='stack（'+str（len（indDF.select（'indicators.*'）.columns））+'，'+'，'.join（[''%s'，cast（indicators.%s as string）“%（col，col）for indDF中的col.select（'indicators.'.'.columns]）+'））为（指标类型、指标值）'