Java 计算文件中特定字符串的出现次数_Java_Regex_String

Java 计算文件中特定字符串的出现次数

java regex string

Java 计算文件中特定字符串的出现次数,java,regex,string,Java,Regex,String,以下是我编写的代码： while ((lineContents = tempFileReader.readLine()) != null) { String lineByLine = lineContents.replaceAll("/\\.", System.getProperty("line.separator")); //for matching /. and replacing it by new line changer.write(li

以下是我编写的代码：

while ((lineContents = tempFileReader.readLine()) != null)
{
            String lineByLine = lineContents.replaceAll("/\\.", System.getProperty("line.separator")); //for matching /. and replacing it by new line
            changer.write(lineByLine);
            Pattern pattern = Pattern.compile("\\r?\\n"); //Find new line
            Matcher matcher = pattern.matcher(lineByLine);
            while(matcher.find())
            {
                Pattern tagFinder = Pattern.compile("word"); //Finding the word required
                Matcher tagMatcher = tagFinder.matcher(lineByLine);
                while(tagMatcher.find())
                {
                    score++;
                }
                scoreTracker.add(score);
                    score = 0;
            }   
}

我的示例输入包含6行，出现的

word

为

[0,1,0,3,0,0]

因此，当我打印

scoreTracker

（这是一个

ArrayList

）时，我需要上面的输出。但是相反，我得到了

[4,4,4,4,4,4]

，它是

单词

的总出现次数，而不是逐行出现。

请帮忙。

lineByLine

指向文件的全部内容。这就是您获得

[4,4,4,4,4,4]

的原因。您需要将每一行存储在另一个变量

line

中，然后使用

tagFinder.find（line）

。最终代码如下所示

while ((lineContents = tempFileReader.readLine()) != null)
{
    String lineByLine = lineContents.replaceAll("/\\.", System.getProperty("line.separator")); //for matching /. and replacing it by new line
    changer.write(lineByLine);
    Pattern pattern = Pattern.compile(".*\\r?\\n"); //Find new line
    Matcher matcher = pattern.matcher(lineByLine);
    while(matcher.find())
    {
        Pattern tagFinder = Pattern.compile("word"); //Finding the word required
        //matcher.group() returns the input subsequence matched by the previous match.
        Matcher tagMatcher = tagFinder.matcher(matcher.group());
        while(tagMatcher.find())
        {
            score++;
        }
        scoreTracker.add(score);
            score = 0;
    }   
}

也许此代码将帮助您：

    String str = "word word\n \n word word\n \n word\n";
    Pattern pattern = Pattern.compile("(.*)\\r?\\n"); //Find new line
    Matcher matcher = pattern.matcher(str);
    while(matcher.find())
    {
        Pattern tagFinder = Pattern.compile("word"); //Finding the word required
        Matcher tagMatcher = tagFinder.matcher(matcher.group());
        int score = 0;
        while(tagMatcher.find())
        {
            score++;
        }
        System.out.print(score + " ");
    }

输出是

2 0 0 1

它不是高度优化的，但您的问题是您从未限制内部匹配，它总是扫描整行。

您可以使用Scanner类。将扫描器初始化为要计数的字符串，然后只计算扫描器找到的这些令牌的数量

您可以直接使用FileInputStream初始化扫描仪

生成的代码只有9行：

File file = new File(fileName);
Scanner scanner = new Scanner(file);
scanner.useDelimiter("your text here");
int occurences;
while(scanner.hasNext()){
     scanner.next();
     occurences++;
}
scanner.close();

这是因为每次搜索同一字符串时（lineByLine）。您可能打算分别搜索每一行。我建议你：

    Pattern tagFinder = Pattern.compile("word"); //Finding the word required
    for(String line : lineByLine.split("\\n")
    {
        Matcher tagMatcher = tagFinder.matcher(line);
        while(tagMatcher.find())
            score++;
        scoreTracker.add(score);
        score = 0;
    }

原始代码使用

tempFileReader.readLine（）

一次读取一行输入，然后使用

matcher

查找每行中的行尾。由于

lineContents

仅包含一行，

matcher

从未找到新行，因此跳过其余代码。为什么需要两个不同的代码位来将输入拆分为行？您可以删除与查找新行相关的代码位之一。例如

while ((lineContents = tempFileReader.readLine()) != null)
{
      Pattern tagFinder = Pattern.compile("word"); //Finding the word required
      Matcher tagMatcher = tagFinder.matcher(lineContents);
      while(tagMatcher.find())
      {
          score++;
      }
      scoreTracker.add(score);
      score = 0;

}

我在Windows上使用文件test.txt尝试了上面的代码，该文件由

BufferedReader

读取。例如

BufferedReader tempFileReader = new BufferedReader(new FileReader("c:\\test\\test.txt"));

scoreTracker包含[0,1,0,3,0,0]用于包含您描述的内容的文件。

如果示例输入是所述的实际文件，

tempFileReader

是一个

BufferedReader

，我不明白您是如何从原始代码中获得[4,4,4,4,4,4,4,4]的。查看用于设置

tempFileReader

的代码会很有用，但这就是为什么我首先在

字符串中找到新行，然后将其应用于结果我的分数的原因。匹配器的while
循环？我错了吗？@KazekageGaara你的代码中有两个问题。一个是第一个regex模式
用于查找新行分隔符。它不捕捉线本身。因此，您需要将正则表达式更改为（.*）\\r？\\n
。其次，您正在调用matcher.find（）
，而不是在任何地方调用matcher.group（）
来提取匹配项。做这两个改变，应该没问题。有关Matcher
对象的详细信息，请参见此处