List Scala处理字符串列表并生成映射[“组合”,使用该组合的列表计数]
我有一个Seq[List[String]]。例如:List Scala处理字符串列表并生成映射[“组合”,使用该组合的列表计数],list,scala,dictionary,consolidation,List,Scala,Dictionary,Consolidation,我有一个Seq[List[String]]。例如: Vector( ["B","D","A","P","F"], ["B","A","F"], ["B","D","A","F"], ["B","D","A","T","F"], ["B","A","P","F"], ["B","D","A","P","F"], ["B","A","F"], ["B","A","F"], ["B","A","F"], ["B","A","F"] ) 我想得到Map[String,Int]中不同组
Vector(
["B","D","A","P","F"],
["B","A","F"],
["B","D","A","F"],
["B","D","A","T","F"],
["B","A","P","F"],
["B","D","A","P","F"],
["B","A","F"],
["B","A","F"],
["B","A","F"],
["B","A","F"]
)
我想得到Map[String,Int]中不同组合(如“A”、“B”)的计数,其中键(String)是元素组合,值(Int)是具有这种组合的列表的计数如果“A”和“B”以及“F”出现在所有10条记录中,而不是出现“A”、10和“B”、10和“C”,10希望将其合并为“A”、“B”、“F”10
上述Seq[列表[字符串]的样本(不包括所有组合)结果
Map(
""A","B","F"" -> 10,
""A","B","D"" -> 4,
""A","B","P"" -> 2,
...
...
..
)
如果能为我提供任何scala代码/解决方案以获得此输出,我将不胜感激。矢量的格式不是正确的scala语法,我认为您的意思是:
val items = Seq(
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "D", "A", "F"),
Seq("B", "D", "A", "T", "F"),
Seq("B", "A", "P", "F"),
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F")
)
res1: scala.collection.immutable.Map[List[String],Int] = HashMap(List(A, B, F, P) -> 1, List(A, B, D, F) -> 1, List(A, B, D, F, T) -> 1, List(A, B, D, F, P) -> 2, List(A, B, F) -> 5)
听起来您试图完成的是两个groupby
子句。首先,您希望从每个列表中获取所有组合,然后获取集合中最频繁的组合,获取它们出现的频率,然后对于以相同频率出现的组,执行另一个group by
并将它们合并在一起
为此,您将需要以下函数在double groupby之后执行双重还原
步骤:
Seq[Seq[String]]
,其中Seq[String]
是唯一的组合。这是平坦的,因为(1到group.length)
操作会生成Seq
的Seq[Seq[String]]
。然后,我们将向量中所有列表的所有映射展平在一起,您必须得到一个Seq[Seq[String]]
groupMapReduce
函数用于计算某个组合出现的频率,然后为每个组合指定一个值1进行求和。这给出了某个组合出现的频率Seq(“a”、“B”)
这样的组转换为“a”、“B”
,然后如果Seq(“a”、“B”)
与另一个组Seq(“C”)
具有相同的计数,则该组作为“a”、“B”、“C”连接在一起
可以针对(1到group.length)
子句中的特定兴趣组调整此筛选器。如果将3限制为3,则组将为
List(List(B, D, P), List(A, D, P), List(D, F, P)): 2
List(List(A, B, F)): 10
List(List(B, D, F), List(A, D, F), List(A, B, D)): 4
List(List(A, F, P), List(B, F, P), List(A, B, P)): 3
List((List(B, D, T), List(A, F, T), List(B, F, T), List(A, D, T), List(A, B, T), List(D, F, T)): 1
As you can see in your example, `List(B, D, F)` and `List(A, D, F)` are also associated with your second line "A,B,D".
假设具有不同顺序的数据计为一组,例如:BAF和ABF将在一组中,则解决方案为
//define the data
val a = Seq(
List("B","D","A","P","F"),
List("B","A","F"),
List("B","D","A","F"),
List("B","D","A","T","F"),
List("B","A","P","F"),
List("B","D","A","P","F"),
List("B","A","F"),
List("B","A","F"),
List("B","A","F"),
List("A","B","F")
)
//you need to sorted so B,A,F will be counted as the same as A,B,F
//as all other data with different sequence
val b = a.map(_.sorted)
//group by identity, and then count the length
b.groupBy(identity).collect{case (x, y) => (x, y.length)}
输出如下:
val items = Seq(
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "D", "A", "F"),
Seq("B", "D", "A", "T", "F"),
Seq("B", "A", "P", "F"),
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F")
)
res1: scala.collection.immutable.Map[List[String],Int] = HashMap(List(A, B, F, P) -> 1, List(A, B, D, F) -> 1, List(A, B, D, F, T) -> 1, List(A, B, D, F, P) -> 2, List(A, B, F) -> 5)
要进一步了解Scala的groupBy标识是如何工作的,您可以转到这里:
scala> def count(seq: Seq[Seq[String]]): Map[Seq[String], Int] =
| seq.flatMap(_.toSet.subsets.filter(_.nonEmpty)).groupMapReduce(identity)(_ => 1)(_ + _)
| .toSeq.sortBy(-_._1.size).foldLeft(Map.empty[Set[String], Int]){ case (r, (p, i)) =>
| if(r.exists{ (q, j) => i == j && p.subsetOf(q)}) r else r.updated(p, i)
| }.map{ case(k, v) => (k.toSeq, v) }
def count(seq: Seq[Seq[String]]): Map[Seq[String], Int]
scala> count(Seq(
| Seq("B", "D", "A", "P", "F"),
| Seq("B", "A", "F"),
| Seq("B", "D", "A", "F"),
| Seq("B", "D", "A", "T", "F"),
| Seq("B", "A", "P", "F"),
| Seq("B", "D", "A", "P", "F"),
| Seq("B", "A", "F"),
| Seq("B", "A", "F"),
| Seq("B", "A", "F"),
| Seq("B", "A", "F")
| ))
val res1: Map[Seq[String], Int] =
HashMap(List(F, A, B) -> 10,
List(F, A, B, P, D) -> 2,
List(T, F, A, B, D) -> 1,
List(F, A, B, D) -> 4,
List(F, A, B, P) -> 3)
正如您所看到的,“A,B,D”和“A,B,p”在结果中被减少,因为“ABDF”和“ABPDF”的是子集…每个键都必须是3个字符串的串联吗?对2或4个字符串组合不感兴趣吗?为什么在每个键中包括所有引号和逗号?为什么不“ABF”
或“A,B,F?”“
如果你说Seq[List[String]
数据的例子应该是:Seq(List(“B”、“D”、“A”、“P”、“F”)、List(“B”、“A”、“F”)、List(“B”、“D”、“A”、“F”)、List(“B”、“A”、“F”)、List(“B”、“D”、“A”、“F”)、List(“B”、“D”、“A”、“F”)、List(“B”、“A”、“P”、“F”)、List(“B”、“D”、“A”、“P”、“F”)、List(“B”、“D”、“A”、“P”、“F”)、List(“B”、“A”、“F”),列表(“B”、“A”、“F”)、列表(“B”、“A”、“F”)、列表(“B”、“A”、“F”))