使用Java为文本文件内容中的每个单词编制索引
我正在尝试使用java为文本文件中的每个单词编制索引 索引意味着我在这里指的是单词的索引 这是我的示例文件 (我要索引的实际文件要大得多) 这是我迄今为止尝试过的代码使用Java为文本文件内容中的每个单词编制索引,java,arrays,arraylist,Java,Arrays,Arraylist,我正在尝试使用java为文本文件中的每个单词编制索引 索引意味着我在这里指的是单词的索引 这是我的示例文件 (我要索引的实际文件要大得多) 这是我迄今为止尝试过的代码 ArrayList<String> ar = new ArrayList<String>(); ArrayList<String> sen = new ArrayList<String>(); ArrayList<String> fin = new ArrayList&l
ArrayList<String> ar = new ArrayList<String>();
ArrayList<String> sen = new ArrayList<String>();
ArrayList<String> fin = new ArrayList<String>();
ArrayList<String> word = new ArrayList<String>();
String content = new String(Files.readAllBytes(Paths.get("D:\\folder\\poem.txt")), StandardCharsets.UTF_8);
String[] split = content.split("\\s"); // Split text file content
for(String b:split) {
ar.add(b); // added into the ar arraylist //ar contains every line of poem
}
FileInputStream fstream = null;
String answer = "";fstream=new FileInputStream("D:\\folder\\poemt.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
int count = 1;
int songnum = 0;
while((strLine=br.readLine())!=null) {
String text = strLine.replaceAll("[0-9]", ""); // Replace numbers from txt
String nums = strLine.split("(?=\\D)")[0]; // get digits from strLine
if (nums.matches(".*[0-9].*")) {
songnum = Integer.parseInt(nums); // Parse string to int
}
String regex = ".*\\d+.*";
boolean result = strLine.matches(regex);
if (result == true) { // check if strLine contain digit
count = 1;
}
answer = songnum + "." + count + "(" + text + ")";
count++;
sen.add(answer); // added songnum + line number and text to sen
}
for(int i = 0;i<sen.size();i++) { // loop to match and get word+poem number+line number
for (int j = 0; j < ar.size(); j++) {
if (sen.get(i).contains(ar.get(j))) {
if (!ar.get(j).isEmpty()) {
String x = ar.get(j) + " - " + sen.get(i);
x = x.replaceAll("\\(.*\\)", ""); // replace single line sentence
String[] sp = x.split("\\s+");
word.add(sp[0]); // each word in the poem is added to the word arraylist
fin.add(x); // word+poem number+line number
}
}
}
}
Set<String> listWithoutDuplicates = new LinkedHashSet<String>(fin); // Remove duplicates
fin.clear();fin.addAll(listWithoutDuplicates);
Locale lithuanian = new Locale("ta");
Collator lithuanianCollator = Collator.getInstance(lithuanian); // sort array
Collections.sort(fin,lithuanianCollator);
System.out.println(fin);
(change in blossom. - 0.2,1.2, & the - 0.1,1.2, & then - 0.1,1.2)
ArrayList ar=new ArrayList();
ArrayList sen=新的ArrayList();
ArrayList fin=新的ArrayList();
ArrayList word=新的ArrayList();
字符串内容=新字符串(Files.readAllBytes(path.get(“D:\\folder\\poem.txt”)),StandardCharsets.UTF_8);
String[]split=content.split(\\s”);//分割文本文件内容
for(字符串b:拆分){
ar.add(b);//添加到ar数组列表//ar包含每一行诗
}
FileInputStream fstream=null;
字符串答案=”;fstream=newfileinputstream(“D:\\folder\\poemt.txt”);
BufferedReader br=新的BufferedReader(新的InputStreamReader(fstream));
弦斯特林;
整数计数=1;
int-songnum=0;
而((strLine=br.readLine())!=null){
String text=strLine.replaceAll(“[0-9]”,“”);//替换txt中的数字
字符串nums=strLine.split((?=\\D)”)[0];//从strLine获取数字
if(nums.matches(“%0-9]”){
songnum=Integer.parseInt(nums);//将字符串解析为int
}
字符串regex=“.\\d+.*”;
布尔结果=strLine.matches(regex);
如果(result==true){//检查strLine是否包含数字
计数=1;
}
答案=songnum+“+count+”(“+text+”);
计数++;
sen.add(answer);//将songnum+行号和文本添加到sen
}
对于(int i=0;i我将首先复制粘贴示例的预期输出,然后查看代码以了解如何更改它:
Poem.txt
预期产量
由于注释中含有@Pal注释,一些单词(the
、和
)未被索引。出于索引目的,可能会忽略这些单词
代码的当前输出为
[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]
因此,假设您修复了stopwords,您实际上已经非常接近了。您的fin
数组包含word+诗歌编号+行号
,但它应该包含word+*list*诗歌编号+行号
。有几种方法可以修复此问题。首先,我们需要删除stopwords:
// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);
if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x);
这给了我[blossom.-0.2,1.2,comed,-0.1,day-0.1,pauld-1.1,reserve-0.2,to-0.2]
。我还没有修改停止字列表以获得完美匹配,但这应该很容易做到。我将首先复制粘贴示例的预期输出,然后检查代码以找到如何更改它:
Poem.txt
预期产量
由于注释中含有@Pal注释,一些单词(the
、和
)未被索引。出于索引目的,可能会忽略这些单词
代码的当前输出为
[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]
因此,假设您修复了stopwords,您实际上已经非常接近了。您的fin
数组包含word+诗歌编号+行号
,但它应该包含word+*list*诗歌编号+行号
。有几种方法可以修复此问题。首先,我们需要删除stopwords:
// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);
if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x);
这给了我[开花-0.2,1.2,来了-0.1,第-0.1天,痛苦-1.1,保持-0.2,到-0.2]
。我还没有确定要获得完美匹配的停止词列表,但这应该很容易做到。很难阅读您的代码。请重新格式化它。不过,测试用例确实很有帮助。这个问题可能需要更多的投票才能引起足够的注意。请详细说明单词索引。测试用例显示单词“And”,“更多”不被考虑。为什么?这回答了你的问题吗?@Pal Laden实际上在0和之间没有空格。我用来拆分单词。我的原始文件有空格。这是我的错误,我忘了在那里留空格:(阅读代码很困难。请重新格式化。不过,测试用例确实有帮助。这个问题可能需要更多的投票才能引起足够的注意。请详细说明单词索引。测试用例显示单词“和”,“更多”“没有考虑。为什么?这回答了你的问题吗?@Pal Laden实际上0和之间没有空格。我用来拆分单词。我的原始文件有空格。这是我的错误,我忘了在那里留空格:(
List<String> fixed = new ArrayList<>();
String prevWord = "";
String prevLocs = "";
for (String s : fin) {
String[] parts = s.split(" - ");
if (parts[0].equals(prevWord)) {
prevLocs += "," + parts[1];
} else {
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
prevWord = parts[0];
prevLocs = parts[1];
}
}
// last iteration
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
System.out.println(fixed);
// build stopwords
String[] stopWords = new String[]{ "and", "a", "the", "to", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);
// prepare always-sorted, quick-lookup set of terms
Collator lithuanianCollator = Collator.getInstance(new Locale("ta"));
Map<String, List<String>> terms = new TreeMap<>((o1, o2) -> lithuanianCollator.compare(o1, o2));
// read lines; if line starts with number, store separately
Pattern countPattern = Pattern.compile("([0-9]+)\\.(.*)");
String content = new String(Files.readAllBytes(Paths.get("/tmp/poem.txt")), StandardCharsets.UTF_8);
int poemCount = 0;
int lineCount = 1;
for (String line: content.split("[\n\r]+")) {
line = line.toLowerCase().trim(); // remove spaces on both sides
// update locations
Matcher m = countPattern.matcher(line);
if (m.matches()) {
poemCount = Integer.parseInt(m.group(1));
lineCount = 1;
line = m.group(2); // ignore number for word-finding purposes
} else {
lineCount ++;
}
// read words in line, with locations already taken care of
for (String word: line.split(" ")) {
if ( ! toIgnore.contains(word)) {
if ( ! terms.containsKey(word)) {
terms.put(word, new ArrayList<>());
}
terms.get(word).add(poemCount + "." + lineCount);
}
}
}
// output formatting to match that of your code
List<String> output = new ArrayList<>();
for (Map.Entry<String, List<String>> e: terms.entrySet()) {
output.add(e.getKey() + " - " + String.join(",", e.getValue()));
}
System.out.println(output);