Java 如何在大量的单词中找到最常用的单词(如900000个)

Java 如何在大量的单词中找到最常用的单词(如900000个),java,algorithm,arraylist,collections,Java,Algorithm,Arraylist,Collections,我面临的任务是生成900000个随机单词,然后打印出最频繁的单词。这是我的算法: 1. move all number into a collection rather than printhing out them 2. for (900000...){move the frequency of Collection[i] to another collection B} ** 90W*90W is too much for a computer(lack of efficiency) 3.

我面临的任务是生成900000个随机单词,然后打印出最频繁的单词。这是我的算法:

1. move all number into a collection rather than printhing out them
2. for (900000...){move the frequency of Collection[i] to another collection B}
** 90W*90W is too much for a computer(lack of efficiency)
3. find the biggest number in that collection and the index.
4. then B[index] is output.
但问题是我的电脑无法处理第二步。所以我在这个网站上搜索,找到了一些关于在一堆单词中查找单词频率的答案,我查看了答案代码,但我还没有找到一种方法将它们应用到大量的单词中

现在我在这里显示我的代码:

/** Funny Words Generator
  * Tony
  */

import java.util.*;

public class WordsGenerator {

  //data field (can be accessed in whole class):
  private static int xC; // define a xCurrent so we can access it all over the class
  private static int n;
  private static String[] consonants = {"b","c","d","f","g","h","j","k","l","m","n","p","r","s","t","v","w","x","z"};
  private static String[] vowels = {"a", "e", "i", "o", "u"};
  private static String funnyWords = "";



  public static void main(String[] args) {

    Scanner sc = new Scanner(System.in);
    int times = 900000; // words number
    xC = sc.nextInt(); // seeds (only input)

    /* Funny word list */
    ArrayList<String> wordsList = new ArrayList<String>();
    ArrayList<Integer> frequencies = new ArrayList<Integer>();
    int maxFreq;
    for (int i = 0; i < times; i++) {
      n = 6; // each words are 6 characters long
      funnyWords = ""; // reset the funnyWords each new time
      for (int d = 0; d < n; d ++) {

        int letterNum = randomGenerator(); /* random generator will generate numbers based on current x */
        int letterIndex = 0; /* letterNum % 19 or % 5 based on condition */

        if ((d + 1) % 2 == 0) {
          letterIndex = letterNum % 5;
          funnyWords += vowels[letterIndex];
        }

        else if ((d + 1) % 2 != 0) {
          letterIndex = letterNum % 19;
          funnyWords += consonants[letterIndex];
        }
      }
      wordsList.add(funnyWords);
    }


    /* put all frequencies of each words into an array called frequencies */
    for (int i = 0; i < 900000; i++) {
      frequencies.add(Collections.frequency(wordsList, wordsList.get(i)));
    }



    maxFreq = Collections.max(frequencies);
    int index = frequencies.indexOf(maxFreq); // get the index of the most frequent word
    System.out.print(wordsList.get(index));


    sc.close();
  }

  /** randomGenerator
    * param: N(generate times), seeds
    * return: update the xC and return it */
  private static int randomGenerator() {
    int a = 445;
    int c = 700001;
    int m = 2097152;
    xC = (a * xC + c) % m; // update
    return xC; // return
  }

}
/**有趣的单词生成器
*托尼
*/
导入java.util.*;
公共类字生成器{
//数据字段(可在全班访问):
private static int xC;//定义一个xCurrent,以便我们可以在整个类中访问它
私有静态int n;
私有静态字符串[]辅音={“b”、“c”、“d”、“f”、“g”、“h”、“j”、“k”、“l”、“m”、“n”、“p”、“r”、“s”、“t”、“v”、“w”、“x”、“z”};
私有静态字符串[]元音={“a”、“e”、“i”、“o”、“u”};
私有静态字符串funnyWords=“”;
公共静态void main(字符串[]args){
扫描仪sc=新的扫描仪(System.in);
整数倍=900000;//字数
xC=sc.nextInt();//种子(仅输入)
/*有趣的单词表*/
ArrayList wordsList=新的ArrayList();
ArrayList频率=新的ArrayList();
int-maxFreq;
for(int i=0;i
所以我意识到也许有一种方法可以跳过第二步。谁能给我一个提示?只是一个提示,而不是代码,所以我可以尝试自己将是伟大的!谢谢

修改:
我看到你的很多答案代码都包含“words.stream()”,我在谷歌上搜索了一下,却找不到。你们能告诉我在哪里可以找到这种知识吗?此流方法在哪个类中?谢谢大家!

您可以使用
哈希映射
存储单词,值为
对应时间

伪代码如下:

String demo(){
   int maxFrequency = 0;
   String maxFrequencyStr = "";
   String strs[] ;
   Map<String,Integer> map = new HashMap<String,Integer>();
   for(int i = 0; i < 900000;i++){//for
      if(map.containsKey(strs[i])){
          int times = map.get(strs[i]);
          map.put(strs[i], times+1);
          if(maxFrequency<times+1){
              maxFrequency = times + 1;
              maxFrequencyStr = strs[i];
          }
      }
      else{
          map.put(strs[i], 1);
          if(maxFrequency<1){
              maxFrequency = 1;
              maxFrequencyStr = strs[i];
          }
      }
   }//for
   return maxFrequencyStr;
 }
String demo(){
int maxFrequency=0;
字符串maxFrequencyStr=“”;
字符串strs[];
Map Map=newhashmap();
for(int i=0;i<900000;i++){//for
if(地图容器(strs[i])){
int times=map.get(strs[i]);
map.put(strs[i],次+1);

如果(maxFrequency这基本上可以分为两个步骤:

  • 计算单词频率,如
    映射图
    。有几个选项可供选择,请参见示例
  • 计算此映射的最大条目,其中“最大”是指具有最高值的条目
  • 因此,如果你真的能胜任,你可以写得非常简洁:

    private static <T> T maxCountElement(List<? extends T> list)
    {
        return Collections.max(list.stream().collect(Collectors.groupingBy(
            Function.identity(), Collectors.counting())).entrySet(), 
                (e0, e1) -> Long.compare(e0.getValue(), e1.getValue())).getKey();
    }
    

    private static T maxCountElement(List您可以使用Java Lambdas(需要JDK 8)来完成此操作。另外,请注意,您可以在单词列表中使用相同频率的单词

    public class Main {
        public static void main(String[] args) {
    
            List<String> words = new ArrayList<>();
    
            words.add("World");
            words.add("Hello");
            words.add("World");
            words.add("Hello");
    
            // Imagine we have 90000 words in word list
            Set<Map.Entry<String, Integer>> set = words.stream()
                    // Here we create map of unique words and calculates their frequency
                    .collect(Collectors.toMap(word -> word, word -> 1, Integer::sum)).entrySet();
    
            // Find the max frequency
            int max = Collections
                    .max(set, (a, b) -> Integer.compare(a.getValue(), b.getValue())).getValue();
    
            // We can have words with the same frequency like in my words list. Let's get them all
            List<String> list = set.stream()
                    .filter(entry -> entry.getValue() == max)
                    .map(Map.Entry::getKey).collect(Collectors.toList());
    
            System.out.println(list); // [Hello, World]
    
    
        }
    }
    
    公共类主{
    公共静态void main(字符串[]args){
    List words=new ArrayList();
    字。加上(“世界”);
    添加(“你好”);
    字。加上(“世界”);
    添加(“你好”);
    //假设我们在单词表中有90000个单词
    Set=words.stream()
    //在这里,我们创建独特单词的地图并计算它们的频率
    .collect(Collectors.toMap(word->word,word->1,Integer::sum)).entrySet();
    //找到最大频率
    int max=集合
    .max(set,(a,b)->Integer.compare(a.getValue(),b.getValue()).getValue();
    //我们可以有和我的单词列表中相同频率的单词。让我们把它们都找出来
    List=set.stream()
    .filter(条目->条目.getValue()==max)
    .map(map.Entry::getKey).collect(Collectors.toList());
    System.out.println(列表);//[你好,世界]
    }
    }
    
    HashMap是最快的数据结构之一,只需循环遍历每个单词,将其用作HashMap的键,在循环中,使计数器成为HashMap的值

    HashMap<string, Integer> hashMapVariable = new HashMap<>();
    ...
    //inside the loop of words
    if (hashMapVariable.containsKey(word){
       hashMapVariable.put(key, hashMapVariable.get(key) + 1);
    } else {
       hashMapVariable.put(word, 1);
    }
    ...
    
    HashMap hashMapVariable=new HashMap();
    ...
    //字里行间
    if(hashMapVariable.containsKey(word){
    hashMapVariable.put(key,hashMapVariable.get(key)+1);
    }否则{
    hashMapVariable.put(字,1);
    }
    ...
    
    对于每个键(单词),只需增加与该键相关的值。尽管您必须检查该键是否存在(在java中是its
    hashMapVariable.containsKey(“键”)
    )。如果它退出,则只需增加,否则将其添加到HashMap中。这样做并不是恢复整个数据,而是使每个键仅为一个,并使其作为键的值出现的次数


    循环结束时,最频繁的单词将具有最高的计数器/值。

    使用列表将非常慢。还有很多其他的集合要考虑。为什么不尝试1?
    HashMap<string, Integer> hashMapVariable = new HashMap<>();
    ...
    //inside the loop of words
    if (hashMapVariable.containsKey(word){
       hashMapVariable.put(key, hashMapVariable.get(key) + 1);
    } else {
       hashMapVariable.put(word, 1);
    }
    ...