Java 如何在大量的单词中找到最常用的单词(如900000个)
我面临的任务是生成900000个随机单词,然后打印出最频繁的单词。这是我的算法:Java 如何在大量的单词中找到最常用的单词(如900000个),java,algorithm,arraylist,collections,Java,Algorithm,Arraylist,Collections,我面临的任务是生成900000个随机单词,然后打印出最频繁的单词。这是我的算法: 1. move all number into a collection rather than printhing out them 2. for (900000...){move the frequency of Collection[i] to another collection B} ** 90W*90W is too much for a computer(lack of efficiency) 3.
1. move all number into a collection rather than printhing out them
2. for (900000...){move the frequency of Collection[i] to another collection B}
** 90W*90W is too much for a computer(lack of efficiency)
3. find the biggest number in that collection and the index.
4. then B[index] is output.
但问题是我的电脑无法处理第二步。所以我在这个网站上搜索,找到了一些关于在一堆单词中查找单词频率的答案,我查看了答案代码,但我还没有找到一种方法将它们应用到大量的单词中
现在我在这里显示我的代码:
/** Funny Words Generator
* Tony
*/
import java.util.*;
public class WordsGenerator {
//data field (can be accessed in whole class):
private static int xC; // define a xCurrent so we can access it all over the class
private static int n;
private static String[] consonants = {"b","c","d","f","g","h","j","k","l","m","n","p","r","s","t","v","w","x","z"};
private static String[] vowels = {"a", "e", "i", "o", "u"};
private static String funnyWords = "";
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
int times = 900000; // words number
xC = sc.nextInt(); // seeds (only input)
/* Funny word list */
ArrayList<String> wordsList = new ArrayList<String>();
ArrayList<Integer> frequencies = new ArrayList<Integer>();
int maxFreq;
for (int i = 0; i < times; i++) {
n = 6; // each words are 6 characters long
funnyWords = ""; // reset the funnyWords each new time
for (int d = 0; d < n; d ++) {
int letterNum = randomGenerator(); /* random generator will generate numbers based on current x */
int letterIndex = 0; /* letterNum % 19 or % 5 based on condition */
if ((d + 1) % 2 == 0) {
letterIndex = letterNum % 5;
funnyWords += vowels[letterIndex];
}
else if ((d + 1) % 2 != 0) {
letterIndex = letterNum % 19;
funnyWords += consonants[letterIndex];
}
}
wordsList.add(funnyWords);
}
/* put all frequencies of each words into an array called frequencies */
for (int i = 0; i < 900000; i++) {
frequencies.add(Collections.frequency(wordsList, wordsList.get(i)));
}
maxFreq = Collections.max(frequencies);
int index = frequencies.indexOf(maxFreq); // get the index of the most frequent word
System.out.print(wordsList.get(index));
sc.close();
}
/** randomGenerator
* param: N(generate times), seeds
* return: update the xC and return it */
private static int randomGenerator() {
int a = 445;
int c = 700001;
int m = 2097152;
xC = (a * xC + c) % m; // update
return xC; // return
}
}
/**有趣的单词生成器
*托尼
*/
导入java.util.*;
公共类字生成器{
//数据字段(可在全班访问):
private static int xC;//定义一个xCurrent,以便我们可以在整个类中访问它
私有静态int n;
私有静态字符串[]辅音={“b”、“c”、“d”、“f”、“g”、“h”、“j”、“k”、“l”、“m”、“n”、“p”、“r”、“s”、“t”、“v”、“w”、“x”、“z”};
私有静态字符串[]元音={“a”、“e”、“i”、“o”、“u”};
私有静态字符串funnyWords=“”;
公共静态void main(字符串[]args){
扫描仪sc=新的扫描仪(System.in);
整数倍=900000;//字数
xC=sc.nextInt();//种子(仅输入)
/*有趣的单词表*/
ArrayList wordsList=新的ArrayList();
ArrayList频率=新的ArrayList();
int-maxFreq;
for(int i=0;i
所以我意识到也许有一种方法可以跳过第二步。谁能给我一个提示?只是一个提示,而不是代码,所以我可以尝试自己将是伟大的!谢谢
修改:
我看到你的很多答案代码都包含“words.stream()”,我在谷歌上搜索了一下,却找不到。你们能告诉我在哪里可以找到这种知识吗?此流方法在哪个类中?谢谢大家! 您可以使用
哈希映射
和键
存储单词,值为对应时间
伪代码如下:
String demo(){
int maxFrequency = 0;
String maxFrequencyStr = "";
String strs[] ;
Map<String,Integer> map = new HashMap<String,Integer>();
for(int i = 0; i < 900000;i++){//for
if(map.containsKey(strs[i])){
int times = map.get(strs[i]);
map.put(strs[i], times+1);
if(maxFrequency<times+1){
maxFrequency = times + 1;
maxFrequencyStr = strs[i];
}
}
else{
map.put(strs[i], 1);
if(maxFrequency<1){
maxFrequency = 1;
maxFrequencyStr = strs[i];
}
}
}//for
return maxFrequencyStr;
}
String demo(){
int maxFrequency=0;
字符串maxFrequencyStr=“”;
字符串strs[];
Map Map=newhashmap();
for(int i=0;i<900000;i++){//for
if(地图容器(strs[i])){
int times=map.get(strs[i]);
map.put(strs[i],次+1);
如果(maxFrequency这基本上可以分为两个步骤:
计算单词频率,如映射图
。有几个选项可供选择,请参见示例
计算此映射的最大条目,其中“最大”是指具有最高值的条目
因此,如果你真的能胜任,你可以写得非常简洁:
private static <T> T maxCountElement(List<? extends T> list)
{
return Collections.max(list.stream().collect(Collectors.groupingBy(
Function.identity(), Collectors.counting())).entrySet(),
(e0, e1) -> Long.compare(e0.getValue(), e1.getValue())).getKey();
}
private static T maxCountElement(List您可以使用Java Lambdas(需要JDK 8)来完成此操作。另外,请注意,您可以在单词列表中使用相同频率的单词
public class Main {
public static void main(String[] args) {
List<String> words = new ArrayList<>();
words.add("World");
words.add("Hello");
words.add("World");
words.add("Hello");
// Imagine we have 90000 words in word list
Set<Map.Entry<String, Integer>> set = words.stream()
// Here we create map of unique words and calculates their frequency
.collect(Collectors.toMap(word -> word, word -> 1, Integer::sum)).entrySet();
// Find the max frequency
int max = Collections
.max(set, (a, b) -> Integer.compare(a.getValue(), b.getValue())).getValue();
// We can have words with the same frequency like in my words list. Let's get them all
List<String> list = set.stream()
.filter(entry -> entry.getValue() == max)
.map(Map.Entry::getKey).collect(Collectors.toList());
System.out.println(list); // [Hello, World]
}
}
公共类主{
公共静态void main(字符串[]args){
List words=new ArrayList();
字。加上(“世界”);
添加(“你好”);
字。加上(“世界”);
添加(“你好”);
//假设我们在单词表中有90000个单词
Set=words.stream()
//在这里,我们创建独特单词的地图并计算它们的频率
.collect(Collectors.toMap(word->word,word->1,Integer::sum)).entrySet();
//找到最大频率
int max=集合
.max(set,(a,b)->Integer.compare(a.getValue(),b.getValue()).getValue();
//我们可以有和我的单词列表中相同频率的单词。让我们把它们都找出来
List=set.stream()
.filter(条目->条目.getValue()==max)
.map(map.Entry::getKey).collect(Collectors.toList());
System.out.println(列表);//[你好,世界]
}
}
HashMap是最快的数据结构之一,只需循环遍历每个单词,将其用作HashMap的键,在循环中,使计数器成为HashMap的值
HashMap<string, Integer> hashMapVariable = new HashMap<>();
...
//inside the loop of words
if (hashMapVariable.containsKey(word){
hashMapVariable.put(key, hashMapVariable.get(key) + 1);
} else {
hashMapVariable.put(word, 1);
}
...
HashMap hashMapVariable=new HashMap();
...
//字里行间
if(hashMapVariable.containsKey(word){
hashMapVariable.put(key,hashMapVariable.get(key)+1);
}否则{
hashMapVariable.put(字,1);
}
...
对于每个键(单词),只需增加与该键相关的值。尽管您必须检查该键是否存在(在java中是itshashMapVariable.containsKey(“键”)
)。如果它退出,则只需增加,否则将其添加到HashMap中。这样做并不是恢复整个数据,而是使每个键仅为一个,并使其作为键的值出现的次数
循环结束时,最频繁的单词将具有最高的计数器/值。
使用列表将非常慢。还有很多其他的集合要考虑。为什么不尝试1?
HashMap<string, Integer> hashMapVariable = new HashMap<>();
...
//inside the loop of words
if (hashMapVariable.containsKey(word){
hashMapVariable.put(key, hashMapVariable.get(key) + 1);
} else {
hashMapVariable.put(word, 1);
}
...