Pyspark 使用结构和数组类型展平嵌套AVRO模式
上面是我用struct和array类型展平嵌套avro模式的代码,但问题是,当我调用函数get_flant_col(new_schema)时,它没有以列表格式给我输出,而是如下所示 [一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏 柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱] 我使用avsc文件是因为我们需要与avsc和数据文件进行模式比较。 如果需要任何帮助,请告诉我是否需要更多信息Pyspark 使用结构和数组类型展平嵌套AVRO模式,pyspark,Pyspark,上面是我用struct和array类型展平嵌套avro模式的代码,但问题是,当我调用函数get_flant_col(new_schema)时,它没有以列表格式给我输出,而是如下所示 [一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏 柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱] 我使用
df = spark.read.format("avro").load("s3://tsb-datalake-dev-artifacts/COP_FEED_20200518.avro")
df.schema.json()
df.schema
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.functions import *
def normalise_field(raw):
return raw.strip().lower() \
.replace('`', '') \
.replace('-', '_') \
.replace(' ', '_') \
.replace('.', '_') \
.strip('_')
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = "%s.`%s`" % (prefix, field.name) if prefix else "`%s`" % field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(col(name).alias(normalise_field(name)))
return fields
sch = df.select(flatten(df.schema))
sch.columns
avsc_json = df.schema.json()
from pyspark.sql.types import StructType
new_schema = StructType.fromJson(json.loads(avsc_json))
def get_flatten_col(schema, prefix=None):
cols =[]
fields = []
for field in schema.fields:
name = "%s.`%s`" % (prefix, field.name) if prefix else "`%s`" % field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
print("name : " + str(name))
print(type(name))
fields.append(str(normalise_field(name)))
return fields