在Java中编辑文件时跟踪标点、间距
我正在编写一个程序,从一个文本文件中删除重复的连续单词,然后替换没有重复的文本文件。我知道我当前的代码无法处理重复字位于一行末尾和下一行开头的情况,因为我将每一行读入ArrayList,找到重复字并将其删除。但写完之后,我不确定这是否是一种“好”的方式,因为现在我不知道如何把它写出来。我不确定如何跟踪行首和行尾句子的标点符号,以及正确的间距,以及原始文本文件中何时有换行符。有没有办法用我目前掌握的来处理这些事情,比如空格、标点等等?或者,我需要重新设计吗?我想我可以做的另一件事是返回一个数组,其中包含我需要删除的单词索引,但我不确定这是否更好。不管怎样,这是我的密码:提前谢谢在Java中编辑文件时跟踪标点、间距,java,file-io,Java,File Io,我正在编写一个程序,从一个文本文件中删除重复的连续单词,然后替换没有重复的文本文件。我知道我当前的代码无法处理重复字位于一行末尾和下一行开头的情况,因为我将每一行读入ArrayList,找到重复字并将其删除。但写完之后,我不确定这是否是一种“好”的方式,因为现在我不知道如何把它写出来。我不确定如何跟踪行首和行尾句子的标点符号,以及正确的间距,以及原始文本文件中何时有换行符。有没有办法用我目前掌握的来处理这些事情,比如空格、标点等等?或者,我需要重新设计吗?我想我可以做的另一件事是返回一个数组,其
/** Removes consecutive duplicate words from text files.
It accepts only one argument, that argument being a text file
or a directory. It finds all text files in the directory and
its subdirectories and moves duplicate words from those files
as well. It replaces the original file. */
import java.io.*;
import java.util.*;
public class RemoveDuplicates {
public static void main(String[] args) {
if (args.length != 1) {
System.out.println("Program accepts one command-line argument. Exiting!");
System.exit(1);
}
File f = new File(args[0]);
if (!f.exists()) {
System.out.println("Does not exist!");
}
else if (f.isDirectory()) {
System.out.println("is directory");
}
else if (f.isFile()) {
System.out.println("is file");
String fileName = f.toString();
RemoveDuplicates dup = new RemoveDuplicates(f);
dup.showTextFile();
List<String> noDuplicates = dup.doDeleteDuplicates();
showTextFile(noDuplicates);
//writeOutputFile(fileName, noDuplicates);
}
else {
System.out.println("Shouldn't happen");
}
}
/** Reads in each line of the passed in .txt file into the lineOfWords array. */
public RemoveDuplicates(File fin) {
lineOfWords = new ArrayList<String>();
try {
BufferedReader in = new BufferedReader(new FileReader(fin));
for (String s = null; (s = in.readLine()) != null; ) {
lineOfWords.add(s);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
public void showTextFile() {
for (String s : lineOfWords) {
System.out.println(s);
}
}
public static void showTextFile(List<String> list) {
for (String s : list) {
System.out.print(s);
}
}
public List<String> doDeleteDuplicates() {
List<String> noDup = new ArrayList<String>(); // List to be returned without duplicates
// go through each line and split each word into end string array
for (String s : lineOfWords) {
String endString[] = s.split("[\\s+\\p{Punct}]");
// add each word to the arraylist
for (String word : endString) {
noDup.add(word);
}
}
for (int i = 0; i < noDup.size() - 1; i++) {
if (noDup.get(i).toUpperCase().equals(noDup.get(i + 1).toUpperCase())) {
System.out.println("Removing: " + noDup.get(i+1));
noDup.remove(i + 1);
i--;
}
}
return noDup;
}
public static void writeOutputFile(String fileName, List<String> newData) {
try {
PrintWriter outputFile = new PrintWriter(new BufferedWriter(new FileWriter(fileName)));
for (String str : newData) {
outputFile.print(str + " ");
}
outputFile.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
private List<String> lineOfWords;
}
像这样的怎么样?在本例中,我假设它不区分大小写
Pattern p = Pattern.compile("(\\w+) \\1");
String line = "Hello hello this is a test test in order\norder to see if it deletes duplicates Duplicates words.";
Matcher m = p.matcher(line.toUpperCase());
StringBuilder sb = new StringBuilder(1000);
int idx = 0;
while (m.find()) {
sb.append(line.substring(idx, m.end(1)));
idx = m.end();
}
sb.append(line.substring(idx));
System.out.println(sb.toString());
以下是输出:-
Hello this a test in order
order to see if it deletes duplicates words.
你能从sb.append部分开始,进一步解释你的代码吗。我不确定它到底是怎么工作的。m.end1中的1表示正则表达式中由括号包围的组。m、 end1返回该匹配组的最后一个索引,而m.end返回与提供的模式\\w+\\1匹配的整个字符串的最后一个索引。基本上,我忽略了m.end1和m.end之间的任何内容,因为它是m.start1和m.end1之间字符串的副本。在这种情况下,我不使用m.start1,因为我认为没有必要这样做。希望这有帮助。
Hello this a test in order
order to see if it deletes duplicates words.