List Scala处理字符串列表并生成映射[“组合”，使用该组合的列表计数]_List_Scala_Dictionary_Consolidation

List Scala处理字符串列表并生成映射[“组合”，使用该组合的列表计数]

list scala dictionary

List Scala处理字符串列表并生成映射[“组合”，使用该组合的列表计数],list,scala,dictionary,consolidation,List,Scala,Dictionary,Consolidation,我有一个Seq[List[String]]。例如： Vector( ["B","D","A","P","F"], ["B","A","F"], ["B","D","A","F"], ["B","D","A","T","F"], ["B","A","P","F"], ["B","D","A","P","F"], ["B","A","F"], ["B","A","F"], ["B","A","F"], ["B","A","F"] ) 我想得到Map[String，Int]中不同组

我有一个Seq[List[String]]。例如：

Vector(
["B","D","A","P","F"], 
["B","A","F"], 
["B","D","A","F"], 
["B","D","A","T","F"], 
["B","A","P","F"], 
["B","D","A","P","F"], 
["B","A","F"], 
["B","A","F"], 
["B","A","F"], 
["B","A","F"]
)

我想得到Map[String，Int]中不同组合（如“A”、“B”）的计数，其中键（String）是元素组合，值（Int）是具有这种组合的列表的计数

如果“A”和“B”以及“F”出现在所有10条记录中，而不是出现“A”、10和“B”、10和“C”，10希望将其合并为“A”、“B”、“F”10

上述Seq[列表[字符串]的样本（不包括所有组合）结果

Map(
""A","B","F"" -> 10,
""A","B","D"" -> 4,
""A","B","P"" -> 2,
...
...
..
)

如果能为我提供任何scala代码/解决方案以获得此输出，我将不胜感激。

矢量的格式不是正确的scala语法，我认为您的意思是：

val items = Seq(
  Seq("B", "D", "A", "P", "F"),
  Seq("B", "A", "F"),
  Seq("B", "D", "A", "F"),
  Seq("B", "D", "A", "T", "F"),
  Seq("B", "A", "P", "F"),
  Seq("B", "D", "A", "P", "F"),
  Seq("B", "A", "F"),
  Seq("B", "A", "F"),
  Seq("B", "A", "F"),
  Seq("B", "A", "F")
)

res1: scala.collection.immutable.Map[List[String],Int] = HashMap(List(A, B, F, P) -> 1, List(A, B, D, F) -> 1, List(A, B, D, F, T) -> 1, List(A, B, D, F, P) -> 2, List(A, B, F) -> 5)

听起来您试图完成的是两个

groupby

子句。首先，您希望从每个列表中获取所有组合，然后获取集合中最频繁的组合，获取它们出现的频率，然后对于以相同频率出现的组，执行另一个

group by

并将它们合并在一起

为此，您将需要以下函数在double groupby之后执行双重还原

步骤：

收集各组的所有序列。在项目内部，我们计算项目列表中元素的总组合，该列表生成组的

Seq[Seq[String]]

，其中

Seq[String]

是唯一的组合。这是平坦的，因为

（1到group.length）

操作会生成

Seq

的

Seq[Seq[String]]

。然后，我们将向量中所有列表的所有映射展平在一起，您必须得到一个

Seq[Seq[String]]

groupMapReduce

函数用于计算某个组合出现的频率，然后为每个组合指定一个值1进行求和。这给出了某个组合出现的频率

这些组再次分组，但这次是按发生次数分组的。因此，如果“A”和“B”都出现10次，它们将被分组在一起

最终地图减少了累积的组数

这个双约化函数我定义如下。它将一个像

Seq（“a”、“B”）

这样的组转换为

“a”、“B”

，然后如果

Seq（“a”、“B”）

与另一个

组Seq（“C”）

具有相同的计数，则该组作为

“a”、“B”、“C”连接在一起
可以针对（1到group.length）
子句中的特定兴趣组调整此筛选器。如果将3限制为3，则组将为
List(List(B, D, P), List(A, D, P), List(D, F, P)): 2
List(List(A, B, F)): 10
List(List(B, D, F), List(A, D, F), List(A, B, D)): 4
List(List(A, F, P), List(B, F, P), List(A, B, P)): 3
List((List(B, D, T), List(A, F, T), List(B, F, T), List(A, D, T), List(A, B, T), List(D, F, T)): 1

As you can see in your example, `List(B, D, F)` and `List(A, D, F)` are also associated with your second line "A,B,D".

假设具有不同顺序的数据计为一组，例如：BAF和ABF将在一组中，则解决方案为
          //define the data
          val a = Seq(
            List("B","D","A","P","F"),
            List("B","A","F"),
            List("B","D","A","F"),
            List("B","D","A","T","F"),
            List("B","A","P","F"),
            List("B","D","A","P","F"),
            List("B","A","F"),
            List("B","A","F"),
            List("B","A","F"),
            List("A","B","F")
          )

          //you need to sorted so B,A,F will be counted as the same as A,B,F
          //as all other data with different sequence
          val b = a.map(_.sorted)

          //group by identity, and then count the length
          b.groupBy(identity).collect{case (x, y) => (x, y.length)}

输出如下：
val items = Seq(
  Seq("B", "D", "A", "P", "F"),
  Seq("B", "A", "F"),
  Seq("B", "D", "A", "F"),
  Seq("B", "D", "A", "T", "F"),
  Seq("B", "A", "P", "F"),
  Seq("B", "D", "A", "P", "F"),
  Seq("B", "A", "F"),
  Seq("B", "A", "F"),
  Seq("B", "A", "F"),
  Seq("B", "A", "F")
)

res1: scala.collection.immutable.Map[List[String],Int] = HashMap(List(A, B, F, P) -> 1, List(A, B, D, F) -> 1, List(A, B, D, F, T) -> 1, List(A, B, D, F, P) -> 2, List(A, B, F) -> 5)


要进一步了解Scala的groupBy标识是如何工作的，您可以转到这里：
scala> def count(seq: Seq[Seq[String]]): Map[Seq[String], Int] =
     |   seq.flatMap(_.toSet.subsets.filter(_.nonEmpty)).groupMapReduce(identity)(_ => 1)(_ + _)
     |      .toSeq.sortBy(-_._1.size).foldLeft(Map.empty[Set[String], Int]){ case (r, (p, i)) =>
     |        if(r.exists{ (q, j) => i == j && p.subsetOf(q)}) r else r.updated(p, i)
     |      }.map{ case(k, v) => (k.toSeq, v) }
def count(seq: Seq[Seq[String]]): Map[Seq[String], Int]

scala> count(Seq(
     |   Seq("B", "D", "A", "P", "F"),
     |   Seq("B", "A", "F"),
     |   Seq("B", "D", "A", "F"),
     |   Seq("B", "D", "A", "T", "F"),
     |   Seq("B", "A", "P", "F"),
     |   Seq("B", "D", "A", "P", "F"),
     |   Seq("B", "A", "F"),
     |   Seq("B", "A", "F"),
     |   Seq("B", "A", "F"),
     |   Seq("B", "A", "F")
     | ))
val res1: Map[Seq[String], Int] = 
  HashMap(List(F, A, B) -> 10, 
          List(F, A, B, P, D) -> 2, 
          List(T, F, A, B, D) -> 1, 
          List(F, A, B, D) -> 4, 
          List(F, A, B, P) -> 3)

正如您所看到的，“A，B，D”和“A，B，p”在结果中被减少，因为“ABDF”和“ABPDF”的是子集…
每个键都必须是3个字符串的串联吗？对2或4个字符串组合不感兴趣吗？为什么在每个键中包括所有引号和逗号？为什么不“ABF”
或“A，B，F？”“
如果你说Seq[List[String]
数据的例子应该是：Seq（List（“B”、“D”、“A”、“P”、“F”）、List（“B”、“A”、“F”）、List（“B”、“D”、“A”、“F”）、List（“B”、“A”、“F”）、List（“B”、“D”、“A”、“F”）、List（“B”、“D”、“A”、“F”）、List（“B”、“A”、“P”、“F”）、List（“B”、“D”、“A”、“P”、“F”）、List（“B”、“D”、“A”、“P”、“F”）、List（“B”、“A”、“F”），列表（“B”、“A”、“F”）、列表（“B”、“A”、“F”）、列表（“B”、“A”、“F”））