Apache spark 如何在拼花地板分区中使用不同的模式_Apache Spark_Apache Spark Sql_Parquet

Apache spark 如何在拼花地板分区中使用不同的模式

apache-spark

Apache spark 如何在拼花地板分区中使用不同的模式,apache-spark,apache-spark-sql,parquet,Apache Spark,Apache Spark Sql,Parquet,我将json文件读入数据帧。json可以有一个特定于名称的结构字段消息，如下所示 Json1 { "ts":"2020-05-17T00:00:03Z", "name":"foo", "messages":[ { "a":1810, "b":"hello", "c":390 } ] } Json2 { "ts":"2020-05-17T00:00:03Z", "name":"bar

我将json文件读入数据帧。json可以有一个特定于名称的结构字段消息，如下所示

Json1
{
   "ts":"2020-05-17T00:00:03Z",
   "name":"foo",
   "messages":[
      {
         "a":1810,
         "b":"hello",
         "c":390
      }
   ]
}

Json2
{
   "ts":"2020-05-17T00:00:03Z",
   "name":"bar",
   "messages":[
      {
         "b":"my",
         "d":"world"
      }
   ]
}

root
 |-- ts: string (nullable = true)
 |-- name: string (nullable = true)
 |-- messages: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: long (nullable = true)
 |    |    |-- d: string (nullable = true)

当我将JSON中的数据读入数据帧时，我得到如下模式

Json1
{
   "ts":"2020-05-17T00:00:03Z",
   "name":"foo",
   "messages":[
      {
         "a":1810,
         "b":"hello",
         "c":390
      }
   ]
}

Json2
{
   "ts":"2020-05-17T00:00:03Z",
   "name":"bar",
   "messages":[
      {
         "b":"my",
         "d":"world"
      }
   ]
}

root
 |-- ts: string (nullable = true)
 |-- name: string (nullable = true)
 |-- messages: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: long (nullable = true)
 |    |    |-- d: string (nullable = true)

这很好。现在，当我保存到按名称分区的拼花地板文件时，如何在foo和bar分区中使用不同的模式

path/name=foo
root
 |-- ts: string (nullable = true)
 |-- name: string (nullable = true)
 |-- messages: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: long (nullable = true)

path/name=bar
root
 |-- ts: string (nullable = true)
 |-- name: string (nullable = true)
 |-- messages: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- d: string (nullable = true)

当我从根路径读取数据时，如果我得到包含所有foo和bar字段的模式，我就没事了。但当我从path/name=foo读取数据时，我只希望看到foo模式。

1。分区并存储为拼花地板文件：

schema=define structtype...schema
spark.read.schema(schema).parquet(path/name=foo).printSchema()

spark.read.json(path/name=foo).printSchema()
spark.read.csv(path/name=foo).printSchema()

如果保存为拼花格式，则在读取

path/name=foo

时，包括所有必填字段（a、b、c），则spark仅加载这些字段

如果我们不指定模式，那么所有字段（a、b、c、d）都将包含在数据框中

EX:

schema=define structtype...schema spark.read.schema(schema).parquet(path/name=foo).printSchema()

spark.read.json(path/name=foo).printSchema() spark.read.csv(path/name=foo).printSchema()

2.分区并存储为JSON/CSV文件：

schema=define structtype...schema spark.read.schema(schema).parquet(path/name=foo).printSchema()

spark.read.json(path/name=foo).printSchema() spark.read.csv(path/name=foo).printSchema()
然后Spark不会将b，d列添加到
path/name=foo
文件中，因此当我们只读取name=foo目录时，我们不会得到数据中包含的
b，d
列
EX:

schema=define structtype...schema spark.read.schema(schema).parquet(path/name=foo).printSchema()

spark.read.json(path/name=foo).printSchema() spark.read.csv(path/name=foo).printSchema()

1。分区并存储为拼花地板文件：

schema=define structtype...schema spark.read.schema(schema).parquet(path/name=foo).printSchema()

spark.read.json(path/name=foo).printSchema() spark.read.csv(path/name=foo).printSchema()
如果保存为拼花格式，则在读取
path/name=foo
时，包括所有必填字段（a、b、c），则spark仅加载这些字段

如果我们不指定模式，那么所有字段（a、b、c、d）都将包含在数据框中

EX:

schema=define structtype...schema spark.read.schema(schema).parquet(path/name=foo).printSchema()

spark.read.json(path/name=foo).printSchema() spark.read.csv(path/name=foo).printSchema()

2.分区并存储为JSON/CSV文件：

schema=define structtype...schema spark.read.schema(schema).parquet(path/name=foo).printSchema()

spark.read.json(path/name=foo).printSchema() spark.read.csv(path/name=foo).printSchema()
然后Spark不会将b，d列添加到
path/name=foo
文件中，因此当我们只读取name=foo目录时，我们不会得到数据中包含的
b，d
列
EX:

schema=define structtype...schema spark.read.schema(schema).parquet(path/name=foo).printSchema()

spark.read.json(path/name=foo).printSchema() spark.read.csv(path/name=foo).printSchema()

您可以在将数据帧保存到分区之前更改模式，为此，您必须过滤分区记录，然后将它们保存到相应的文件夹中

#this will select only not null columns which will drop col d from foo and a,c from bar df = df.filter(f.col('name')='foo').select(*[c for c in df.columns if df.filter(f.col(c).isNotNull()).count() > 0]) #then save the df df.write.json('path/name=foo')

现在每个分区都将有不同的模式。
在分区中保存数据帧之前，您可以更改模式，为此，您必须过滤分区记录，然后将它们保存在相应的文件夹中

#this will select only not null columns which will drop col d from foo and a,c from bar df = df.filter(f.col('name')='foo').select(*[c for c in df.columns if df.filter(f.col(c).isNotNull()).count() > 0]) #then save the df df.write.json('path/name=foo')

现在每个分区都将有不同的模式。
但是如果使用partitionBy子句保存数据，我认为分区仍将包含不需要的列，但包含所有空值。@ShubhamJain，这仅适用于列格式，但如果存储json，则不会存储空值。。！为每个名称分区动态构建一个数据帧/数据集，然后将它们保存到path/name=，最好的方法是什么？在上面的示例中，我只有2个名称，但在我的输入中，我可以有100个名称。但是，如果使用partitionBy子句保存数据，我认为分区仍将包含非必需的列，但包含所有空值。@ShubhamJain，这仅适用于列格式，但如果存储json，则不会存储空值。。！为每个名称分区动态构建一个数据帧/数据集，然后将它们保存到path/name=，最好的方法是什么？在上面的例子中，我只有2个名字，但在我的输入中，我可以有100个名字。