Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/ssh/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pyspark 使用结构和数组类型展平嵌套AVRO模式_Pyspark - Fatal编程技术网

Pyspark 使用结构和数组类型展平嵌套AVRO模式

Pyspark 使用结构和数组类型展平嵌套AVRO模式,pyspark,Pyspark,上面是我用struct和array类型展平嵌套avro模式的代码,但问题是,当我调用函数get_flant_col(new_schema)时,它没有以列表格式给我输出,而是如下所示 [一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏 柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱] 我使用

上面是我用struct和array类型展平嵌套avro模式的代码,但问题是,当我调用函数get_flant_col(new_schema)时,它没有以列表格式给我输出,而是如下所示 [一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏,一栏

柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱,柱]

我使用avsc文件是因为我们需要与avsc和数据文件进行模式比较。 如果需要任何帮助,请告诉我是否需要更多信息

df = spark.read.format("avro").load("s3://tsb-datalake-dev-artifacts/COP_FEED_20200518.avro")
df.schema.json()
df.schema
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.functions import *
def normalise_field(raw):
  return raw.strip().lower() \
        .replace('`', '') \
        .replace('-', '_') \
        .replace(' ', '_') \
        .replace('.', '_') \
 .strip('_')
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
 name = "%s.`%s`" % (prefix, field.name) if prefix else "`%s`" % field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
    dtype = dtype.elementType
if isinstance(dtype, StructType):
     fields += flatten(dtype, prefix=name)
else:
    fields.append(col(name).alias(normalise_field(name)))
    return fields


 sch = df.select(flatten(df.schema))
    sch.columns
    avsc_json = df.schema.json()
    from pyspark.sql.types import StructType    
    new_schema = StructType.fromJson(json.loads(avsc_json))
    def get_flatten_col(schema, prefix=None):
    cols =[]
    fields = []
    for field in schema.fields:
    name = "%s.`%s`" % (prefix, field.name) if prefix else "`%s`" % field.name
    dtype = field.dataType
    if isinstance(dtype, ArrayType):
    dtype = dtype.elementType
    if isinstance(dtype, StructType):
    fields += flatten(dtype, prefix=name)
    else:
    print("name : " + str(name))
    print(type(name))
    fields.append(str(normalise_field(name)))
    return fields