Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/image-processing/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Apache SPARK GroupByKey备用_Scala_Apache Spark - Fatal编程技术网

Scala Apache SPARK GroupByKey备用

Scala Apache SPARK GroupByKey备用,scala,apache-spark,Scala,Apache Spark,我的表中有以下列[col1、col2、key1、col3、txn_id、dw_last_updated]。在这些txn_id中,key1是主键列。在我的数据集中,我可以有多条记录用于(txn_id,key)的组合。从这些记录中,我需要根据上次更新的数据选择最新的记录 我用的是一种逻辑。我一直在讨论内存问题,我相信这部分是因为groupByKey()。。。有更好的替代方案吗 case class Fact(col1: Int, col2: Int, key1: String, col3

我的表中有以下列[col1、col2、key1、col3、txn_id、dw_last_updated]。在这些txn_id中,key1是主键列。在我的数据集中,我可以有多条记录用于(txn_id,key)的组合。从这些记录中,我需要根据上次更新的数据选择最新的记录

我用的是一种逻辑。我一直在讨论内存问题,我相信这部分是因为groupByKey()。。。有更好的替代方案吗

case class Fact(col1: Int,
  col2: Int,
  key1: String,
  col3: Int,
  txn_id: Double,
  dw_last_updated: Long)

sc.textFile(s3path).map { row =>
          val parts = row.split("\t")
          Fact(parts(0).toInt,
            parts(1).toInt,
            parts(2),
            parts(3).toInt,
            parts(4).toDouble,
            parts(5).toLong)
        }).map { t => ((t.txn_id, t.key1), t) }.groupByKey(512).map {
          case ((txn_id, key1), sequence) =>
            val newrecord = sequence.maxBy {
              case Fact_Cp(col1, col2, key1, col3, txn_id, dw_last_updated) => dw_last_updated.toLong
            }
           (newrecord.col1 + "\t" + newrecord.col2 + "\t" + newrecord.key1 +
              "\t" + newrecord.col3 + "\t" + newrecord.txn_id + "\t" + newrecord.dw_last_updated)
        }

感谢您的想法/建议…

rdd。groupByKey
收集每个键的所有值,需要必要的内存来保存单个节点上键的值序列。不鼓励使用它。看

考虑到我们只对每个键1个值感兴趣:max(dw_last_updated),一种更有效的内存方法是使用
rdd.reduceByKey
,其中reduce函数是使用该时间戳作为判别符来提取同一键的两个记录的最大值

rdd.reduceByKey{case (record1,record2) => max(record1, record2)}
应用于您的案例,它应该如下所示:

case class Fact(...)
object Fact {
  def parse(s:String):Fact = ???
  def maxByTs(f1:Fact, f2:Fact):Fact = if (f1.dw_last_updated.toLong > f2.dw_last_updated.toLong) f1 else f2
}
val factById = sc.textFile(s3path).map{row => val fact = Fact.parse(row); ((fact.txn_id, fact.key1),fact)}
val maxFactById = factById.reduceByKey(Fact.maxByTs)

注意,我已经在
Fact
companion对象上定义了实用程序操作,以保持代码整洁。我还建议为每个转换步骤或逻辑步骤组指定命名变量。它使程序更具可读性

恐怕我觉得不错。您可以尝试更多分区。但也许你只是需要更多的机器。这还没有解决吗?你能详细说明你的答案吗?我如何到达(记录1,记录2)?看一下添加的示例-但不要习惯:-)