Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/list/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
List Scala处理字符串列表并生成映射[“组合”,使用该组合的列表计数]_List_Scala_Dictionary_Consolidation - Fatal编程技术网

List Scala处理字符串列表并生成映射[“组合”,使用该组合的列表计数]

List Scala处理字符串列表并生成映射[“组合”,使用该组合的列表计数],list,scala,dictionary,consolidation,List,Scala,Dictionary,Consolidation,我有一个Seq[List[String]]。例如: Vector( ["B","D","A","P","F"], ["B","A","F"], ["B","D","A","F"], ["B","D","A","T","F"], ["B","A","P","F"], ["B","D","A","P","F"], ["B","A","F"], ["B","A","F"], ["B","A","F"], ["B","A","F"] ) 我想得到Map[String,Int]中不同组

我有一个Seq[List[String]]。例如:

Vector(
["B","D","A","P","F"], 
["B","A","F"], 
["B","D","A","F"], 
["B","D","A","T","F"], 
["B","A","P","F"], 
["B","D","A","P","F"], 
["B","A","F"], 
["B","A","F"], 
["B","A","F"], 
["B","A","F"]
)
我想得到Map[String,Int]中不同组合(如“A”、“B”)的计数,其中键(String)是元素组合,值(Int)是具有这种组合的列表的计数

如果“A”和“B”以及“F”出现在所有10条记录中,而不是出现“A”、10和“B”、10和“C”,10希望将其合并为“A”、“B”、“F”10

上述Seq[列表[字符串]的样本(不包括所有组合)结果

Map(
""A","B","F"" -> 10,
""A","B","D"" -> 4,
""A","B","P"" -> 2,
...
...
..
)

如果能为我提供任何scala代码/解决方案以获得此输出,我将不胜感激。

矢量的格式不是正确的scala语法,我认为您的意思是:

val items = Seq(
  Seq("B", "D", "A", "P", "F"),
  Seq("B", "A", "F"),
  Seq("B", "D", "A", "F"),
  Seq("B", "D", "A", "T", "F"),
  Seq("B", "A", "P", "F"),
  Seq("B", "D", "A", "P", "F"),
  Seq("B", "A", "F"),
  Seq("B", "A", "F"),
  Seq("B", "A", "F"),
  Seq("B", "A", "F")
)
res1: scala.collection.immutable.Map[List[String],Int] = HashMap(List(A, B, F, P) -> 1, List(A, B, D, F) -> 1, List(A, B, D, F, T) -> 1, List(A, B, D, F, P) -> 2, List(A, B, F) -> 5)

听起来您试图完成的是两个
groupby
子句。首先,您希望从每个列表中获取所有组合,然后获取集合中最频繁的组合,获取它们出现的频率,然后对于以相同频率出现的组,执行另一个
group by
并将它们合并在一起

为此,您将需要以下函数在double groupby之后执行双重还原

步骤:

  • 收集各组的所有序列。在项目内部,我们计算项目列表中元素的总组合,该列表生成组的
    Seq[Seq[String]]
    ,其中
    Seq[String]
    是唯一的组合。这是平坦的,因为
    (1到group.length)
    操作会生成
    Seq
    Seq[Seq[String]]
    。然后,我们将向量中所有列表的所有映射展平在一起,您必须得到一个
    Seq[Seq[String]]
  • groupMapReduce
    函数用于计算某个组合出现的频率,然后为每个组合指定一个值1进行求和。这给出了某个组合出现的频率
  • 这些组再次分组,但这次是按发生次数分组的。因此,如果“A”和“B”都出现10次,它们将被分组在一起
  • 最终地图减少了累积的组数
  • 这个双约化函数我定义如下。它将一个像
    Seq(“a”、“B”)
    这样的组转换为
    “a”、“B”
    ,然后如果
    Seq(“a”、“B”)
    与另一个
    组Seq(“C”)
    具有相同的计数,则该组作为
    “a”、“B”、“C”连接在一起

    可以针对
    (1到group.length)
    子句中的特定兴趣组调整此筛选器。如果将
    3限制为3,则组将为

    List(List(B, D, P), List(A, D, P), List(D, F, P)): 2
    List(List(A, B, F)): 10
    List(List(B, D, F), List(A, D, F), List(A, B, D)): 4
    List(List(A, F, P), List(B, F, P), List(A, B, P)): 3
    List((List(B, D, T), List(A, F, T), List(B, F, T), List(A, D, T), List(A, B, T), List(D, F, T)): 1
    
    As you can see in your example, `List(B, D, F)` and `List(A, D, F)` are also associated with your second line "A,B,D".
    

    假设具有不同顺序的数据计为一组,例如:BAF和ABF将在一组中,则解决方案为

              //define the data
              val a = Seq(
                List("B","D","A","P","F"),
                List("B","A","F"),
                List("B","D","A","F"),
                List("B","D","A","T","F"),
                List("B","A","P","F"),
                List("B","D","A","P","F"),
                List("B","A","F"),
                List("B","A","F"),
                List("B","A","F"),
                List("A","B","F")
              )
    
              //you need to sorted so B,A,F will be counted as the same as A,B,F
              //as all other data with different sequence
              val b = a.map(_.sorted)
    
              //group by identity, and then count the length
              b.groupBy(identity).collect{case (x, y) => (x, y.length)}
    
    输出如下:

    val items = Seq(
      Seq("B", "D", "A", "P", "F"),
      Seq("B", "A", "F"),
      Seq("B", "D", "A", "F"),
      Seq("B", "D", "A", "T", "F"),
      Seq("B", "A", "P", "F"),
      Seq("B", "D", "A", "P", "F"),
      Seq("B", "A", "F"),
      Seq("B", "A", "F"),
      Seq("B", "A", "F"),
      Seq("B", "A", "F")
    )
    
    res1: scala.collection.immutable.Map[List[String],Int] = HashMap(List(A, B, F, P) -> 1, List(A, B, D, F) -> 1, List(A, B, D, F, T) -> 1, List(A, B, D, F, P) -> 2, List(A, B, F) -> 5)
    
    
    要进一步了解Scala的groupBy标识是如何工作的,您可以转到这里:

    scala> def count(seq: Seq[Seq[String]]): Map[Seq[String], Int] =
         |   seq.flatMap(_.toSet.subsets.filter(_.nonEmpty)).groupMapReduce(identity)(_ => 1)(_ + _)
         |      .toSeq.sortBy(-_._1.size).foldLeft(Map.empty[Set[String], Int]){ case (r, (p, i)) =>
         |        if(r.exists{ (q, j) => i == j && p.subsetOf(q)}) r else r.updated(p, i)
         |      }.map{ case(k, v) => (k.toSeq, v) }
    def count(seq: Seq[Seq[String]]): Map[Seq[String], Int]
    
    scala> count(Seq(
         |   Seq("B", "D", "A", "P", "F"),
         |   Seq("B", "A", "F"),
         |   Seq("B", "D", "A", "F"),
         |   Seq("B", "D", "A", "T", "F"),
         |   Seq("B", "A", "P", "F"),
         |   Seq("B", "D", "A", "P", "F"),
         |   Seq("B", "A", "F"),
         |   Seq("B", "A", "F"),
         |   Seq("B", "A", "F"),
         |   Seq("B", "A", "F")
         | ))
    val res1: Map[Seq[String], Int] = 
      HashMap(List(F, A, B) -> 10, 
              List(F, A, B, P, D) -> 2, 
              List(T, F, A, B, D) -> 1, 
              List(F, A, B, D) -> 4, 
              List(F, A, B, P) -> 3)
    

    正如您所看到的,“A,B,D”和“A,B,p”在结果中被减少,因为“ABDF”和“ABPDF”的是子集…

    每个键都必须是3个字符串的串联吗?对2或4个字符串组合不感兴趣吗?为什么在每个键中包括所有引号和逗号?为什么不
    “ABF”
    “A,B,F?”“
    如果你说
    Seq[List[String]
    数据的例子应该是:Seq(List(“B”、“D”、“A”、“P”、“F”)、List(“B”、“A”、“F”)、List(“B”、“D”、“A”、“F”)、List(“B”、“A”、“F”)、List(“B”、“D”、“A”、“F”)、List(“B”、“D”、“A”、“F”)、List(“B”、“A”、“P”、“F”)、List(“B”、“D”、“A”、“P”、“F”)、List(“B”、“D”、“A”、“P”、“F”)、List(“B”、“A”、“F”),列表(“B”、“A”、“F”)、列表(“B”、“A”、“F”)、列表(“B”、“A”、“F”))