Java 如何计算文本中出现的单词数_Java

Java 如何计算文本中出现的单词数

java

Java 如何计算文本中出现的单词数,java,Java,我正在做一个项目，写一个程序，在一篇文章中找出10个最常用的单词，但我被卡住了，不知道下一步该怎么做。有人能帮我吗我只走了这么远： import java.io.File; import java.io.FileNotFoundException; import java.util.ArrayList; import java.util.Collections; import java.util.List; import java.util.Scanner; import java.util.

我正在做一个项目，写一个程序，在一篇文章中找出10个最常用的单词，但我被卡住了，不知道下一步该怎么做。有人能帮我吗

我只走了这么远：

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;

public class Lab4 {
    public static void main(String[] args) throws FileNotFoundException {
        Scanner file = new Scanner(new File("text.txt")).useDelimiter("[^a-zA-Z]+");
        List<String> words = new ArrayList<String>();
        while (file.hasNext()){
            String tx = file.next();
            // String x = file.next().toLowerCase();
            words.add(tx);
        }
        Collections.sort(words);
        // System.out.println(words);
    }
}

导入java.io.File；
导入java.io.FileNotFoundException；
导入java.util.ArrayList；
导入java.util.Collections；
导入java.util.List；
导入java.util.Scanner；
导入java.util.regex.Pattern；
公共类Lab4{
公共静态void main（字符串[]args）引发FileNotFoundException{
扫描仪文件=新扫描仪（新文件（“text.txt”）。使用分隔符（[^a-zA-Z]+”）；
List words=new ArrayList（）；
while（file.hasNext（））{
字符串tx=file.next（）；
//字符串x=file.next（）.toLowerCase（）；
添加（tx）；
}
集合。排序（单词）；
//System.out.println（字）；
}
}

您可以使用番石榴多集，下面是一个单词计数示例：

下面是如何找到多集中计数最高的单词：

更新我在2012年写下了这个答案。从那时起，我们有了Java 8，现在可以在没有外部库的情况下在几行中找到10个最常用的单词：

List<String> words = ...

// map the words to their count
Map<String, Integer> frequencyMap = words.stream()
         .collect(toMap(
                s -> s, // key is the word
                s -> 1, // value is 1
                Integer::sum)); // merge function counts the identical words

// find the top 10
List<String> top10 = words.stream()
        .sorted(comparing(frequencyMap::get).reversed()) // sort by descending frequency
        .distinct() // take only unique values
        .limit(10)   // take only the first 10
        .collect(toList()); // put it in a returned list

System.out.println("top10 = " + top10);

创建地图以跟踪事件，如下所示：

   Scanner file = new Scanner(new File("text.txt")).useDelimiter("[^a-zA-Z]+");
   HashMap<String, Integer> map = new HashMap<>();

   while (file.hasNext()){
        String word = file.next().toLowerCase();
        if (map.containsKey(word)) {
            map.put(word, map.get(word) + 1);
        } else {
            map.put(word, 0);
        }
    }

    ArrayList<Map.Entry<String, Integer>> entries = new ArrayList<>(map.entrySet());
    Collections.sort(entries, new Comparator<Map.Entry<String, Integer>>() {

        @Override
        public int compare(Map.Entry<String, Integer> a, Map.Entry<String, Integer> b) {
            return a.getValue().compareTo(b.getValue());
        }
    });

    for(int i = 0; i < 10; i++){
        System.out.println(entries.get(entries.size() - i - 1).getKey());
    }

Scanner file=new Scanner（新文件（“text.txt”）。使用分隔符（[^a-zA-Z]+”）；
HashMap=newHashMap（）；
while（file.hasNext（））{
String word=file.next（）.toLowerCase（）；
if（地图容器（word））{
map.put（单词，map.get（单词）+1）；
}否则{
map.put（word，0）；
}
}
ArrayList entries=新的ArrayList（map.entrySet（））；
Collections.sort（条目，新的Comparator（）{
@凌驾
公共整数比较（Map.Entry a、Map.Entry b）{
返回a.getValue（）.compareTo（b.getValue（））；
}
});
对于（int i=0；i<10；i++）{
System.out.println（entries.get（entries.size（）-i-1.getKey（））；
}

从文件或命令行将输入创建为字符串，并将其传递给下面的方法。它将返回一个映射，其中包含单词作为键，值作为其在该句子或段落中的出现或计数

public Map<String,Integer> getWordsWithCount(String sentances)
{
    Map<String,Integer> wordsWithCount = new HashMap<String, Integer>();

    String[] words = sentances.split(" ");
    for (String word : words)
    {
        if(wordsWithCount.containsKey(word))
        {
            wordsWithCount.put(word, wordsWithCount.get(word)+1);
        }
        else
        {
            wordsWithCount.put(word, 1);
        }

    }

    return wordsWithCount;

}

publicmap getWordsWithCount（字符串语句）
{
Map wordsWithCount=new HashMap（）；
String[]words=sentances.split（“”）；
for（字符串字：字）
{
if（wordsWithCount.containsKey（word））
{
wordsWithCount.put（word，wordsWithCount.get（word）+1）；
}
其他的
{
wordsWithCount.put（word，1）；
}
}
返回wordsWithCount；
}

包src；
导入java.io.File；
导入java.io.FileNotFoundException；
导入java.util.ArrayList；
导入java.util.Collections；
导入java.util.Comparator；
导入java.util.HashMap；
导入java.util.List；
导入java.util.Map；
导入java.util.Scanner；
导入java.util.Map.Entry；
公共级扫描测试
{
公共静态void main（字符串[]args）引发FileNotFoundException
{
Scanner Scanner=新扫描仪（新文件（“G:/Script\u nt.txt”）。使用分隔符（[^a-zA-Z]+”）；
Map Map=newhashmap（）；
while（scanner.hasNext（））
{
字符串字=scanner.next（）；
if（地图容器（word））
{
map.put（单词，map.get（单词）+1）；
}
其他的
{
地图放置（单词1）；
}
}
列表条目=新的ArrayList（map.entrySet（））；
Collections.sort（条目，新的Comparator（）{
@凌驾
公共整数比较（Map.Entry a、Map.Entry b）{
返回a.getValue（）.compareTo（b.getValue（））；
}
});
对于（int i=0；i

这是一个比lbalazscs版本更短的版本，它也使用Java 8的流式API

Arrays.stream(new String(Files.readAllBytes(PATH_TO_FILE), StandardCharsets.UTF_8).split("\\W+"))
            .collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()))
            .entrySet()
            .stream()
            .sorted(((o1, o2) -> o2.getValue().compareTo(o1.getValue())))
            .limit(10)
            .forEach(System.out::println);

Arrays.stream（新字符串（Files.readAllBytes（路径到文件）、StandardCharsets.UTF\u 8.split（\\W+））
.collect（Collectors.groupingBy（Function.identity（），HashMap:：new，counting（））
.entrySet（）
.stream（）
.sorted（（（o1，o2）->o2.getValue（）.compareTo（o1.getValue（）））
.限额（10）
.forEach（System.out:：println）；

这将一次性完成所有工作：加载文件，按非单词字符分割，按单词对所有内容进行分组，并为每组分配单词计数，然后为前十个单词打印带有计数的单词

有关非常类似的设置的深入讨论，请参见：

单词的

列表是不够的，您还需要对每次出现的单词进行计数。对于这样的任务，您将使用什么数据结构？（很明显，这是家庭作业，这就是我提出这个问题的原因）我想你在阅读文件时有一个bug。file.next（）最终将为null，因此您应该检查该值。向下投票，因为对于这样一个简单的任务使用库太过分了。谁说OP应该对此任务使用“仅”番石榴？对于优秀的Java程序员来说，Guava就像标准的集合。你只要知道就行了。Multimap有望被添加到Java8中。对不起，先生，我不是Java开发人员（事实上我讨厌它），所以我不知道Guava是这样的东西。关键是，OP的措辞和具体问题让我相信他可能只是刚刚开始，在那个阶段引入第三方依赖性是个坏主意。如果答案“没有用”，你应该投否决票，如果你认为答案“太先进”，你就不应该投否决票。Stackoverflow也是供将来参考的，你不知道谁会发现这个解决方案在未来是优雅和有用的……虽然你的评论是完全正确的，并使我改变了一点看法，但我仍然认为我的
package src;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.Map.Entry;

public class ScannerTest
{
    public static void main(String[] args) throws FileNotFoundException
        {
        Scanner scanner = new Scanner(new File("G:/Script_nt.txt")).useDelimiter("[^a-zA-Z]+");
        Map<String, Integer> map = new HashMap<String, Integer>();
        while (scanner.hasNext())
            {
            String word = scanner.next();
            if (map.containsKey(word))
                {
                map.put(word, map.get(word)+1);
                }
            else
                {
                map.put(word, 1);
                }
            }

        List<Map.Entry<String, Integer>> entries = new ArrayList<Entry<String,Integer>>( map.entrySet());

        Collections.sort(entries, new Comparator<Map.Entry<String, Integer>>() {

            @Override
            public int compare(Map.Entry<String, Integer> a, Map.Entry<String, Integer> b) {
                return a.getValue().compareTo(b.getValue());
            }
        });

        for(int i = 0; i < map.size(); i++){
            System.out.println(entries.get(entries.size() - i - 1).getKey()+" "+entries.get(entries.size() - i - 1).getValue());
        }
        }
}

Arrays.stream(new String(Files.readAllBytes(PATH_TO_FILE), StandardCharsets.UTF_8).split("\\W+"))
            .collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()))
            .entrySet()
            .stream()
            .sorted(((o1, o2) -> o2.getValue().compareTo(o1.getValue())))
            .limit(10)
            .forEach(System.out::println);