在java中标记字符串后删除停止字

在java中标记字符串后删除停止字,java,tokenize,stop-words,Java,Tokenize,Stop Words,我想在标记字符串后删除停止字。我有一个外部文件.txt并读取它,然后将其与标记化字符串进行比较。如果标记化字与停止字相等,则将其删除 下面是标记化的代码 try{ while ((msg =readBufferData.readLine()) != null) { int numberOfTokens; System.out.println("Before: "+msg);

我想在标记字符串后删除停止字。我有一个外部文件.txt并读取它,然后将其与标记化字符串进行比较。如果标记化字与停止字相等,则将其删除

下面是标记化的代码

try{
            while ((msg =readBufferData.readLine()) != null) {
                int numberOfTokens;

                System.out.println("Before: "+msg);
                StringTokenizer tokens = new StringTokenizer(msg);

                numberOfTokens = tokens.countTokens();
                System.out.println("Tokens: "+numberOfTokens);

                System.out.print("After : ");
                while (tokens.hasMoreTokens()) {
                    msg = tokens.nextToken();
                    String msgLower = msg.toLowerCase();
                    String punctuationremove = punctuationRemover(msgLower);  
          //          buffWriter.write(punctuationremove+" "); --> write into file .txt
                    System.out.print(punctuationremove+" ");
                    removingStopWord(punctuationremove, readStopWordsFile());
                    numberOfTotalTokens++;   
                }
           //     buffWriter.newLine(); make a new line after tokening new message
                System.out.println("\n");
                numberOfMessages++;
            }
        // write close    buffWriter.close();
            System.out.println("Total Tokens: "+numberOfTotalTokens);
            System.out.println("Total Messages: "+numberOfMessages);
        }
        catch (Exception e){
            System.out.println("Error Exception: "+e.getMessage());
        } 
然后我有一个代码来读取停止字文件

public static Set<String> readStopWordsFile() throws FileNotFoundException, IOException{
    String fileStopWords = "\\stopWords.txt";

    Set<String> stopWords = new LinkedHashSet<String>();
    FileReader readFileStopWord = new FileReader(fileStopWords);
    BufferedReader stopWordsFile = new BufferedReader(readFileStopWord);

    String line;

    while((line = stopWordsFile.readLine())!=null){
        line = line.trim();
        stopWords.add(line);
    }
    stopWordsFile.close();
    return stopWords;
}
public static Set readStopWordsFile()抛出FileNotFoundException、IOException{
字符串fileStopWords=“\\stopWords.txt”;
Set stopWords=new LinkedHashSet();
FileReader readFileStopWord=新文件读取器(fileStopWords);
BufferedReader stopWordsFile=新的BufferedReader(readFileStopWord);
弦线;
而((line=stopWordsFile.readLine())!=null){
line=line.trim();
停止字。添加(行);
}
stopWordsFile.close();
返回停止词;
}

如何将令牌与停止字集进行比较,并删除与停止字相同的令牌。你能帮我吗,谢谢你

你只需先读一下停止字,然后检查你的代币是否是停止字

Set<String> stopWords = readStopWordsFile();

  // some file reading logic
  while (tokens.hasMoreTokens()) {
       msg = tokens.nextToken();
       if(stopWords.contains(msg)){
         continue; // skip over a stopword token
       }
  }
Set stopWords=readStopWordsFile();
//一些文件读取逻辑
while(tokens.hasMoreTokens()){
msg=tokens.nextToken();
if(stopWords.contains(msg)){
continue;//跳过stopword标记
}
}

仍然无法删除停止字,感谢way@RickyKristianButar-butar这是什么意思?它提供了与以前相同的输出。它不需要拆下挡块word@RickyKristianButar-但是,你能告诉我
msg
看起来像什么,以及
stopWords
包含什么吗?否则,我给您的唯一建议是使用debugger.msg是标记字符串后的输出,例如:my name is ricky,output:my | name | is | ricky,与将字符串拆分为单词相同。停止字文件包含没有任何信息的字,例如:a、虽然、是、和等等,因此我需要删除标记化结果中的所有停止字