Java 如何从tika提取的文本中提取频繁出现的词_Java_File_Apache Tika_Word Frequency

Java 如何从tika提取的文本中提取频繁出现的词

java file

Java 如何从tika提取的文本中提取频繁出现的词,java,file,apache-tika,word-frequency,Java,File,Apache Tika,Word Frequency,我使用下面的代码（使用tika）提取了多种文件格式（pdf、html、doc）的文本现在我的要求是从提取的内容中获取频繁出现的单词，你能建议我怎么做吗谢谢这里有一个函数，用于最常用的单词您需要将内容传递给函数，然后获得频繁出现的单词 String getMostFrequentWord(String input) { String[] words = input.split(" "); // Create a dictionary using word as key, an

我使用下面的代码（使用tika）提取了多种文件格式（pdf、html、doc）的文本

现在我的要求是从提取的内容中获取频繁出现的单词，你能建议我怎么做吗

谢谢

这里有一个函数，用于最常用的单词

您需要将内容传递给函数，然后获得频繁出现的单词

String getMostFrequentWord(String input) {
    String[] words = input.split(" ");
    // Create a dictionary using word as key, and frequency as value
    Map<String, Integer> dictionary = new HashMap<String, Integer>();
    for (String word : words) {
        if (dictionary.containsKey(word)) {
            int frequency = dictionary.get(word);
            dictionary.put(word, frequency + 1);
        } else {
            dictionary.put(word, 1);
        }
    }

    int max = 0;
    String mostFrequentWord = "";
    Set<Entry<String, Integer>> set = dictionary.entrySet();
    for (Entry<String, Integer> entry : set) {
        if (entry.getValue() > max) {
            max = entry.getValue();
            mostFrequentWord = entry.getKey();
        }
    }

    return mostFrequentWord;
}

String getMostFrequentWord（字符串输入）{
String[]words=input.split（“”）；
//创建一个字典，使用单词作为关键字，使用频率作为值
Map dictionary=newhashmap（）；
for（字符串字：字）{
if（字典.containsKey（单词））{
int frequency=dictionary.get（word）；
字典。put（单词，频率+1）；
}否则{
字典。put（单词，1）；
}
}
int max=0；
字符串mostFrequentWord=“”；
Set=dictionary.entrySet（）；
用于（条目：集合）{
if（entry.getValue（）>max）{
max=entry.getValue（）；
mostFrequentWord=entry.getKey（）；
}
}
返回最频繁的单词；
}

算法是O（n），因此性能应该可以。

这里有一个函数，用于最频繁的单词
您需要将内容传递给函数，然后获得频繁出现的单词

String getMostFrequentWord(String input) { String[] words = input.split(" "); // Create a dictionary using word as key, and frequency as value Map<String, Integer> dictionary = new HashMap<String, Integer>(); for (String word : words) { if (dictionary.containsKey(word)) { int frequency = dictionary.get(word); dictionary.put(word, frequency + 1); } else { dictionary.put(word, 1); } } int max = 0; String mostFrequentWord = ""; Set<Entry<String, Integer>> set = dictionary.entrySet(); for (Entry<String, Integer> entry : set) { if (entry.getValue() > max) { max = entry.getValue(); mostFrequentWord = entry.getKey(); } } return mostFrequentWord; }

String getMostFrequentWord（字符串输入）{ String[]words=input.split（“”）； //创建一个字典，使用单词作为关键字，使用频率作为值 Map dictionary=newhashmap（）； for（字符串字：字）{ if（字典.containsKey（单词））{ int frequency=dictionary.get（word）；字典。put（单词，频率+1）； }否则{ 字典。put（单词，1）； } } int max=0；字符串mostFrequentWord=“”； Set=dictionary.entrySet（）；用于（条目：集合）{ if（entry.getValue（）>max）{ max=entry.getValue（）； mostFrequentWord=entry.getKey（）； } } 返回最频繁的单词； }

算法为O（n），因此性能应该可以。
是，内容存储在json对象中是，内容存储在json对象中