Java 从文件中删除停止字-多次检查会导致内容重复，而不会删除这些字_Java_File Io_Stop Words

Java 从文件中删除停止字-多次检查会导致内容重复，而不会删除这些字

java file-io

Java 从文件中删除停止字-多次检查会导致内容重复，而不会删除这些字,java,file-io,stop-words,Java,File Io,Stop Words,我试图浏览一堆文件，读取其中的每一个，然后从指定的列表中删除包含这些单词的所有stopwords。结果是一场灾难——整个文件的内容被一次又一次地复制我的尝试： -将文件保存为字符串并尝试使用正则表达式查找 -将文件保存为字符串，逐行检查并将标记与存储在LinkedHashSet中的停止字进行比较，我还可以将它们存储在文件中 -试图以多种方式扭曲下面的逻辑，得到越来越荒谬的结果。 -尝试使用.contains（）方法查看文本/行，但没有成功我的一般逻辑如下： for every word in

我试图浏览一堆文件，读取其中的每一个，然后从指定的列表中删除包含这些单词的所有stopwords。结果是一场灾难——整个文件的内容被一次又一次地复制

我的尝试：
-将文件保存为字符串并尝试使用正则表达式查找
-将文件保存为字符串，逐行检查并将标记与存储在LinkedHashSet中的停止字进行比较，我还可以将它们存储在文件中
-试图以多种方式扭曲下面的逻辑，得到越来越荒谬的结果。
-尝试使用

.contains（）

方法查看文本/行，但没有成功

我的一般逻辑如下：

for every word in the stopwords set:
    while(file has more lines):
        save current line into String
        while (current line has more tokens):
            assign current token into String
            compare token with current stopword:
                if(token equals stopword):
                     write in the output file "" + " " 
                else: write in the output file the token as is

private static void removeStopWords(File fileIn) throws IOException {
        File stopWordsTXT = new File("stopwords.txt");
        System.out.println("[Removing StopWords...] FILE: " + fileIn.getName() + "\n");

        // create file reader and go over it to save the stopwords into the Set data structure
        BufferedReader readerSW = new BufferedReader(new FileReader(stopWordsTXT));
        Set<String> stopWords = new LinkedHashSet<String>();

        for (String line; (line = readerSW.readLine()) != null; readerSW.readLine()) {
            // trim() eliminates leading and trailing spaces
            stopWords.add(line.trim());
        }           

        File outp = new File(fileIn.getPath().substring(0, fileIn.getPath().lastIndexOf('.')) + "_NoStopWords.txt");
        FileWriter fOut = new FileWriter(outp);

        Scanner readerTxt = new Scanner(new FileInputStream(fileIn), "UTF-8");
        while(readerTxt.hasNextLine()) {
            String line = readerTxt.nextLine();
            System.out.println(line);
            Scanner lineReader = new Scanner(line);

            for (String curSW : stopWords) {
                while(lineReader.hasNext()) {
                    String token = lineReader.next();
                    if(token.equals(curSW)) {
                        System.out.println("---> Removing SW: " + curSW);
                        fOut.write("" + " ");
                    } else {
                        fOut.write(token + " ");
                    }
                }
            }
            fOut.write("\n");
        }       
        fOut.close();
}

还有很多其他的问题，但都不能满足我的需要

下面的真实代码：

for every word in the stopwords set:
    while(file has more lines):
        save current line into String
        while (current line has more tokens):
            assign current token into String
            compare token with current stopword:
                if(token equals stopword):
                     write in the output file "" + " " 
                else: write in the output file the token as is

private static void removeStopWords(File fileIn) throws IOException {
        File stopWordsTXT = new File("stopwords.txt");
        System.out.println("[Removing StopWords...] FILE: " + fileIn.getName() + "\n");

        // create file reader and go over it to save the stopwords into the Set data structure
        BufferedReader readerSW = new BufferedReader(new FileReader(stopWordsTXT));
        Set<String> stopWords = new LinkedHashSet<String>();

        for (String line; (line = readerSW.readLine()) != null; readerSW.readLine()) {
            // trim() eliminates leading and trailing spaces
            stopWords.add(line.trim());
        }           

        File outp = new File(fileIn.getPath().substring(0, fileIn.getPath().lastIndexOf('.')) + "_NoStopWords.txt");
        FileWriter fOut = new FileWriter(outp);

        Scanner readerTxt = new Scanner(new FileInputStream(fileIn), "UTF-8");
        while(readerTxt.hasNextLine()) {
            String line = readerTxt.nextLine();
            System.out.println(line);
            Scanner lineReader = new Scanner(line);

            for (String curSW : stopWords) {
                while(lineReader.hasNext()) {
                    String token = lineReader.next();
                    if(token.equals(curSW)) {
                        System.out.println("---> Removing SW: " + curSW);
                        fOut.write("" + " ");
                    } else {
                        fOut.write(token + " ");
                    }
                }
            }
            fOut.write("\n");
        }       
        fOut.close();
}

对于令牌，我指的是单词，即从行中获取每个单词，并将其与当前停止字进行比较。经过一段时间的调试，我相信我已经找到了解决方案。这个问题非常棘手，因为您必须使用几种不同的扫描仪和文件读取器等。以下是我所做的：

我更改了您添加到StopWords集合的方式，因为它没有正确添加它们。我用一个缓冲读取器读取每一行，然后用扫描仪读取每一个单词，然后将其添加到集合中

然后，当您比较它们时，我去掉了其中一个循环，因为您可以轻松地使用.contains（）方法检查单词是否是stopWord

我让你去做写文件的部分，去掉停止词，因为我相信你现在可以明白，其他一切都在工作

-我的示例停止字txt文件：停止语言语

-我的示例输入文件是完全相同的，因此它应该捕获所有三个单词

守则：

// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader("stopWords.txt"));
Set<String> stopWords = new LinkedHashSet<String>();
String stopWordsLine = readerSW.readLine();
while (stopWordsLine != null) {
 // trim() eliminates leading and trailing spaces
 Scanner words = new Scanner(stopWordsLine);
 String word = words.next();
 while(word != null) {
       stopWords.add(word.trim());   //Add the stop words to the set

       if(words.hasNext()) {
             word = words.next();   //If theres another line, read it
       }
       else {
            break;    //else break the inner while loop
       }

}

stopWordsLine = readerSW.readLine();
}           

BufferedReader outp = new BufferedReader(new FileReader("Words.txt"));
String line = outp.readLine();

while(line != null) {

 Scanner lineReader = new Scanner(line);
 String line2 = lineReader.next();
 while(line2 != null) {
     if(stopWords.contains(line2)) {
           System.out.println("removing " + line2);
         }
     if(lineReader.hasNext()) { //If theres another line, read it
        line2 = lineReader.next();
      }
      else {
           break;       //else break the first while loop
      }

}

lineReader.close();
    line = outp.readLine();
}

让我知道我是否可以详细说明我的代码或为什么我做了什么

你的停止语列表是什么样子的？请贴一个“代币”的例子。如果它真的是一行单词，

equals（）

将永远找不到匹配项。@DiabolicWords我用我对标记和部分列表的意思更新了最后的问题。谢谢非常感谢你！这似乎奏效了。我仍然不清楚为什么我尝试过的所有237452354234不同的东西都不起作用。如何添加停止词？@IamWhoIam它只是读取第一行，然后将第一行添加到阵列样式中，这也是我注意到的，无论我尝试了什么，都没有效果。谢谢：）