Scala 组合Spark模式而不重复？_Scala_Apache Spark_Schema_Apache Spark 1.6

Scala 组合Spark模式而不重复？

scala apache-spark

Scala 组合Spark模式而不重复？,scala,apache-spark,schema,apache-spark-1.6,Scala,Apache Spark,Schema,Apache Spark 1.6,为了处理我所拥有的数据，我在之前提取了模式，因此当我读取数据集时，我提供了模式，而不是经过昂贵的推断模式的步骤为了构建模式，我需要将几个不同的模式合并到最终的模式中，因此我一直在使用union（+）和distinct方法，但我一直得到org.apache.spark.sql.AnalysisException:replicate column异常例如，假设我们在以下结构中有两个模式： val schema1 = StructType(StructField("A", StructType(

为了处理我所拥有的数据，我在之前提取了模式，因此当我读取数据集时，我提供了模式，而不是经过昂贵的推断模式的步骤

为了构建模式，我需要将几个不同的模式合并到最终的模式中，因此我一直在使用

union（+）

和

distinct

方法，但我一直得到

org.apache.spark.sql.AnalysisException:replicate column

异常

例如，假设我们在以下结构中有两个模式：

val schema1 = StructType(StructField("A", StructType(
    StructField("i", StringType, true) :: Nil
    ), true) :: Nil)

val schema2 = StructType(StructField("A", StructType(
    StructField("i", StringType, true) :: Nil
    ), true) :: Nil)

val schema3 = StructType(StructField("A", StructType(
    StructField("i", StringType, true) ::
    StructField("ii", StringType, true) :: Nil
    ), true) :: Nil)

val final_schema = (schema1 ++ schema2 ++ schema3).distinct

println(final_schema)

哪些产出：

StructType(
    StructField(A,StructType(
         StructField(i,StringType,true)),true), 
    StructField(A,StructType(
        StructField(i,StringType,true),    
        StructField(ii,StringType,true)),true))

我知道只有与另一个模式完全匹配的模式结构才会被

distinct

过滤掉。但是，我希望结果如下所示：

StructType(
    StructField(A,StructType(
        StructField(i,StringType,true),    
        StructField(ii,StringType,true)),true))

(schema1 ++ schema2 ++ schema3).groupBy(getKey).map(_._2.head)

在这种模式中，所有的数据都被“组合”到一个模式中。我已经筛选了所有的方法，但我似乎找不到正确的方法来解决这个问题。有什么想法吗

编辑：

最终目标是将

final_schema

输入到

sqlContext.read.schema

，并使用

read

方法读取JSON字符串的RDD。

尝试以下方法：

StructType(
    StructField(A,StructType(
        StructField(i,StringType,true),    
        StructField(ii,StringType,true)),true))

(schema1 ++ schema2 ++ schema3).groupBy(getKey).map(_._2.head)

<> > <代码> GETKEY 是一个函数，它从一个模式到要合并的属性（例如，列名或子字段的名称）。在

map

函数中，您可以选择头部或使用一些更复杂的函数来保留特定的模式。

使用Scala激发：

val consolidatedSchema = test1Df.schema.++:(test2Df.schema).toSet
val uniqueConsolidatedSchemas = StructType(consolidatedSchema.toSeq)

StructType consolidatedSchema = test1Df.schema().merge(test2Df.schema());

Spark with Java:

val consolidatedSchema = test1Df.schema.++:(test2Df.schema).toSet
val uniqueConsolidatedSchemas = StructType(consolidatedSchema.toSeq)

StructType consolidatedSchema = test1Df.schema().merge(test2Df.schema());

您知道使用PySpark的类似解决方案会是什么样子吗？