Scala中的结构类型展平_Scala_Apache Spark_Dataframe_User Defined Functions

Scala中的结构类型展平

scala apache-spark dataframe

Scala中的结构类型展平,scala,apache-spark,dataframe,user-defined-functions,Scala,Apache Spark,Dataframe,User Defined Functions,我试图从Spark数据框中的结构类型创建一个列表。模式类似于这样 root | |-- plotList: array (nullable = true) | |-- element: string (containsNull = true) |-- plot: struct (nullable = true) | |-- test: struct (nullable = true) | | |-- body: string (nullable = true) |

我试图从Spark数据框中的结构类型创建一个列表。模式类似于这样

root
|
|-- plotList: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- plot: struct (nullable = true)
|    |-- test: struct (nullable = true)
|    |    |-- body: string (nullable = true)
|    |    |-- colorPair: struct (nullable = true)
|    |    |    |-- background: string (nullable = true)
|    |    |    |-- foreground: string (nullable = true)
|    |    |-- eta: struct (nullable = true)
|    |    |    |-- etaText: string (nullable = true)
|    |    |    |-- etaType: string (nullable = true)
|    |    |    |-- etaValue: string (nullable = true)
|    |    |-- headline: string (nullable = true)
|    |    |-- plotType: string (nullable = true)
|    |    |-- priority: long (nullable = true)
|    |    |-- plotCategory: string (nullable = true)
|    |    |-- productType: string (nullable = true)
|    |    |-- theme: string (nullable = true)
|    |-- temp: struct (nullable = true)
|    |    |-- body: string (nullable = true)
|    |    |-- colorPair: struct (nullable = true)
|    |    |    |-- background: string (nullable = true)
|    |    |    |-- foreground: string (nullable = true)
|    |    |-- eta: struct (nullable = true)
|    |    |    |-- etaText: string (nullable = true)
|    |    |    |-- etaType: string (nullable = true)
|    |    |    |-- etaValue: string (nullable = true)
|    |    |-- headline: string (nullable = true)
|    |    |-- logo: string (nullable = true)
|    |    |-- plotType: string (nullable = true)
|    |    |-- priority: long (nullable = true)
|    |    |-- plotCategory: string (nullable = true)
|    |    |-- plotType: string (nullable = true)
|    |    |-- theme: string (nullable = true)

我正在尝试编写一个UDF，它可以将plot列转换为元素列表，以便在下一次迭代中分解。在plot-->[test，temp]的行中，我可以从test和temp中选择一些特定的列。如果有任何正确的方向，我会非常感激的。我尝试过多种UDF变体，但似乎都不起作用

编辑：

我想从绘图列的子列创建展开结构。我正在考虑使用case类来实现这一点。差不多

case class ColorPair(back:String, fore:String)
case class Eta(EtaText: String, EtaType: String, EtaValue: String)
case class Plot(body:String, colorPair: ColorPair, eta: Eta, headline: String, plotType: String, priority: String, plotCategory: String, plotType: String, theme: String)

因此，基本上在这篇文章的结尾，我期待类似于

列表（Plot）

的东西，然后我可以

在接下来的步骤中分解它。因为explode不能直接作用于结构类型
，所以我必须完成这个转换。在python世界中，我很容易将本专栏作为字典来阅读，但Scala中没有类似的内容（据我所知）。
如果我理解正确，那么您正在寻找一种迭代模式的方法，当找到colorPair或eta时，返回以下字段：
plot.test.colorPair
plot.test.eta
plot.temp.colorPair
plot.temp.eta

为了为您的案例生成数据（模式），我编写了以下代码：
  case class Eta(etaText: String, etaType: String, etaValue: String)
  case class ColorPair(background: String, foreground: String)
  case class Test(body: String, colorPair: ColorPair, eta: Eta, headline: String, plotType: String, priority: Long, plotCategory: String, productType: String, theme: String)
  case class Temp(body: String, colorPair: ColorPair, eta: Eta ,headline: String, logo: String, plotType: String, priority: Long, plotCategory: String, productType: String, theme: String)
  case class Plot(test: Test, temp: Temp)
  case class Root(plotList: Array[String], plot: Plot)

  def getSchema(): StructType ={
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.catalyst.ScalaReflection
    val schema = ScalaReflection.schemaFor[Root].dataType.asInstanceOf[StructType]

    schema.printTreeString()
    schema
  }

这将产生以下输出：
root
 |-- plotList: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- plot: struct (nullable = true)
 |    |-- test: struct (nullable = true)
 |    |    |-- body: string (nullable = true)
 |    |    |-- colorPair: struct (nullable = true)
 |    |    |    |-- background: string (nullable = true)
 |    |    |    |-- foreground: string (nullable = true)
 |    |    |-- eta: struct (nullable = true)
 |    |    |    |-- etaText: string (nullable = true)
 |    |    |    |-- etaType: string (nullable = true)
 |    |    |    |-- etaValue: string (nullable = true)
 |    |    |-- headline: string (nullable = true)
 |    |    |-- plotType: string (nullable = true)
 |    |    |-- priority: long (nullable = false)
 |    |    |-- plotCategory: string (nullable = true)
 |    |    |-- productType: string (nullable = true)
 |    |    |-- theme: string (nullable = true)
 |    |-- temp: struct (nullable = true)
 |    |    |-- body: string (nullable = true)
 |    |    |-- colorPair: struct (nullable = true)
 |    |    |    |-- background: string (nullable = true)
 |    |    |    |-- foreground: string (nullable = true)
 |    |    |-- eta: struct (nullable = true)
 |    |    |    |-- etaText: string (nullable = true)
 |    |    |    |-- etaType: string (nullable = true)
 |    |    |    |-- etaValue: string (nullable = true)
 |    |    |-- headline: string (nullable = true)
 |    |    |-- logo: string (nullable = true)
 |    |    |-- plotType: string (nullable = true)
 |    |    |-- priority: long (nullable = false)
 |    |    |-- plotCategory: string (nullable = true)
 |    |    |-- productType: string (nullable = true)
 |    |    |-- theme: string (nullable = true)

最后，下一个代码应该展平所需的字段：
def flattenSchema(schema: StructType, targetFields: List[String], prefix: String = null): Array[String]=
  {
    import org.apache.spark.sql.types._
    schema.fields.flatMap(f => {
      val colName = if (prefix == null) f.name else (prefix + "." + f.name)

      f.dataType match {
        case st : StructType =>
          val found = st.filter(s => targetFields.contains(s.name))

          if(found.isEmpty) {
            flattenSchema(st, targetFields, colName)
          }
          else
            found.flatMap(sf => {
              val st = sf.dataType.asInstanceOf[StructType]
              st.map(st => s"${colName}.${sf.name}.${st.name}")
            })

        case _ => Array[String]()
      }
    })
  }

上面的代码正在扫描架构以查找targetFields
列表中存在的字段，然后使用flatMap
检索这些字段的架构
这应该是输出：
plot.test.colorPair.background
plot.test.colorPair.foreground
plot.test.eta.etaText
plot.test.eta.etaType
plot.test.eta.etaValue
plot.temp.colorPair.background
plot.temp.colorPair.foreground
plot.temp.eta.etaText
plot.temp.eta.etaType
plot.temp.eta.etaValue

你能告诉我们你试过什么，为什么不满意吗？这将真正帮助我们理解您要做的事情。据我所见，df.select（$“plotList”），array（$“plot.test”，$“plot.temp”）作为“plot”）
将起作用，但我不确定我是否理解您需要什么。@Oli我正在尝试概括您提到的选择操作。本质上，我试图从test和temp（比如eta和plotType）中选择某些子列，比如[（eta，plotType），（eta，plotType）]，在后面的阶段我将分解此结构。我也尝试过选择操作，但那不是我想要的。另外，在实际的数据集中，我有比temp和test更多的元素。所以，我不能用手把它们全部看一遍。我想如果你能1。将您的模式简化为问题的最小实例（我们可能不需要所有这些字段来解决您的问题）2。提供一个您试图实现的示例。@Oli所示的示例是此问题所需的最低实例。嵌套结构类型和字符串类型的混合。我已经更新了这个问题，我希望在这里实现什么。非常感谢您的关注。感谢@Alexandros的输入。我想在下一阶段使用它来选择列，所以我更新了代码以返回Array[String]（f.name）
，以防类型不是struct。我试过了，但是由于某种原因，代码直到第四级的plot.test.eta.etaText或任何东西才运行。我遗漏了什么吗？Hi@prateek您必须在select语句中使用结果，即df.select（“data.plot.test.eta.etatetext”）。在你的情况下，你有json数据吗？是的，我让它工作了。尽管嵌套结构的爆炸按预期工作，但这不是我要找的。我可能会提出另一个与此相关的问题。在许多其他用例中，接受你的答案是非常有帮助的。非常感谢！！很好，这将遍历任何模式并提取colorPar和ETA的路径。此外，DataFrameFlatter的更干净的实现可以在