Apache spark 在数据帧中部分更新记录的有效方法_Apache Spark_Dataframe_Hive_Hbase_Orc

Apache spark 在数据帧中部分更新记录的有效方法

apache-spark dataframe hive hbase

Apache spark 在数据帧中部分更新记录的有效方法,apache-spark,dataframe,hive,hbase,orc,Apache Spark,Dataframe,Hive,Hbase,Orc,我有一个在快照中累积批处理数据的系统批处理中的每条记录都包含一个唯一的\u id、一个版本和多个其他列以前，每当在新批处理中出现一个唯一的_id，其版本大于快照中的版本时，系统会用来替换整个记录并重写为新记录。这通常是基于版本的两个数据帧的合并例如： Snapshot: <Uid> <Version> <col1> <col2> ----------------- A1 | 1

我有一个在快照中累积批处理数据的系统

批处理中的每条记录都包含一个唯一的\u id、一个版本和多个其他列

以前，每当在新批处理中出现一个唯一的_id，其版本大于快照中的版本时，系统会用来替换整个记录并重写为新记录。这通常是基于版本的两个数据帧的合并

例如：

 Snapshot: <Uid>   <Version> <col1> <col2>
           -----------------
              A1  | 1     |  ab | cd
              A2  | 1     |  ef | gh

 New Batch: <Uid>  <Version> <col1> 
           ------------------
              A3  | 1     |  gh
              A1  | 2     |  hh

请参见A1唯一id，col2值是完整的

尽管该批次有A1 as的记录

New Batch: <Uid>  <Version> <col1> <col2>
           ------------------
            A1  | 2     |  hh  | uu

这里，A2的整个记录被替换

根据当前的系统，我使用spark并将数据存储为拼花地板。我可以调整合并过程以合并此更改

然而，我想知道这是否是为这些用例存储数据的最佳过程

我正在评估Hbase和Hive ORC以及可能对合并过程进行的更改

如有任何建议，我们将不胜感激。

据我所知，您需要在snapshot和journaldelta之间使用完全外部联接，然后使用coalesce，例如：

  def applyDeduplicatedJournal(snapshot: DataFrame, journal: DataFrame, joinColumnNames: Seq[String]): DataFrame = {

    val joinExpr = joinColumnNames
      .map(column => snapshot(column) === journal(column))
      .reduceLeft(_ && _)

    val isThereNoJournalRecord = joinColumnNames
      .map(jCol => journal(jCol).isNull)
      .reduceLeft(_ && _)

    val selectClause = snapshot.columns
      .map(col => when(isThereNoJournalRecord, snapshot(col)).otherwise(coalesce(journal(col), snapshot(col))) as col)

    snapshot
      .join(journal, joinExpr, "full_outer")
      .select(selectClause: _*)
}

在这种情况下，当日志值为空时，您将合并快照和具有回退到快照值的日志

希望有帮助

那么，如果有一个新的A1呢？您需要在示例中添加这一点。感谢您指出这一点。我已经更新了问题。2种不同的批处理格式，对吗？不确定我是否在这里得到了问题。batchesOk中的不同模式，更难但不明显

New Batch: <Uid>  <Version> <col1> <col2>
           ------------------
            A1  | 2     |  hh  | uu

              A1  | 2     |  hh  | uu
              A2  | 1     |  ef  | gh

  def applyDeduplicatedJournal(snapshot: DataFrame, journal: DataFrame, joinColumnNames: Seq[String]): DataFrame = {

    val joinExpr = joinColumnNames
      .map(column => snapshot(column) === journal(column))
      .reduceLeft(_ && _)

    val isThereNoJournalRecord = joinColumnNames
      .map(jCol => journal(jCol).isNull)
      .reduceLeft(_ && _)

    val selectClause = snapshot.columns
      .map(col => when(isThereNoJournalRecord, snapshot(col)).otherwise(coalesce(journal(col), snapshot(col))) as col)

    snapshot
      .join(journal, joinExpr, "full_outer")
      .select(selectClause: _*)
}