Java 在烫伤中生成列表[字符串]的差异

Java 在烫伤中生成列表[字符串]的差异,java,scala,scalding,Java,Scala,Scalding,我有一个记录:TypedType[(String,util.List[String])]在我的工作中,第一个值是一个id,第二个值是一个内容列表。想象一下: ("1", ["a","b","c"]) ("1", ["a","b","c"]) ("1", ["a","b","c"]) ("2", ["a","b"]) ("2", ["a","b","c"]) ("3", ["a","b","c"]) 在records.groupBy(u._1)之后,我只想输出对于给定id彼此不同的记录。对于上面

我有一个
记录:TypedType[(String,util.List[String])]
在我的工作中,第一个值是一个id,第二个值是一个内容列表。想象一下:

("1", ["a","b","c"])
("1", ["a","b","c"])
("1", ["a","b","c"])
("2", ["a","b"])
("2", ["a","b","c"])
("3", ["a","b","c"])
records.groupBy(u._1)
之后,我只想输出对于给定id彼此不同的记录。对于上面的输入,输出应该是:

("2", ["a","b"])
("2", ["a","b","c"])

我是烫伤新手。实现这一目标的优雅方式是什么?

我不知道这一点对你是否至关重要(你的收藏是否特别庞大?),但在普通的老Scala中,我会:

// Given:
val records = Seq( "1" -> List("a", "b", "c"), "1" -> List("a", "b", "c"), "1" -> List("a", "b", "c"), "2" -> List("a", "b"), "2" -> List("a", "b", "c"), "3" -> List("a", "b", "c"), "3" -> List("d")

val distinctValues = records.groupBy(_._1).map { case (k, v) => k -> v.toSet }
// => Map(2 -> Set((2,List(a, b)), (2,List(a, b, c))), 1 -> Set((1,List(a, b, c))), 3 -> Set((3,List(a, b, c)), (3,List(d))))

val havingMultipleDistinct = distinctValues.map { case (k, v) => v.size > 1 }
// => Map(2 -> Set((2,List(a, b)), (2,List(a, b, c))), 3 -> Set((3,List(a, b, c)), (3,List(d))))

val asRecords = havingMultipleDistinct.values.flatten
// => List((2,List(a, b)), (2,List(a, b, c)), (3,List(a, b, c)), (3,List(d)))

如果每个键的值的大小足够小,可以放入内存中,那么像这样的操作就可以了:

records
  .group
  .toSet
  .filter(_.size > 1)
  .flatten
如果它太大,则可以将管道本身连接起来:

val grouped = records.group
grouped
 .join(grouped)
 .collect { case(k, (a, b)) if a != b => k -> a }

是的,它必须在集群上运行。烫伤是根本