使用Java为文本文件内容中的每个单词编制索引_Java_Arrays_Arraylist

使用Java为文本文件内容中的每个单词编制索引

java arrays

使用Java为文本文件内容中的每个单词编制索引,java,arrays,arraylist,Java,Arrays,Arraylist,我正在尝试使用java为文本文件中的每个单词编制索引索引意味着我在这里指的是单词的索引这是我的示例文件（我要索引的实际文件要大得多）这是我迄今为止尝试过的代码 ArrayList<String> ar = new ArrayList<String>(); ArrayList<String> sen = new ArrayList<String>(); ArrayList<String> fin = new ArrayList&l

我正在尝试使用java为文本文件中的每个单词编制索引

索引意味着我在这里指的是单词的索引

这是我的示例文件（我要索引的实际文件要大得多）

这是我迄今为止尝试过的代码

ArrayList<String> ar = new ArrayList<String>();
ArrayList<String> sen = new ArrayList<String>();
ArrayList<String> fin = new ArrayList<String>();
ArrayList<String> word = new ArrayList<String>();
String content = new String(Files.readAllBytes(Paths.get("D:\\folder\\poem.txt")), StandardCharsets.UTF_8);

String[] split = content.split("\\s"); // Split text file content
for(String b:split) {
    ar.add(b); // added into the ar arraylist //ar contains every line of poem
}
FileInputStream fstream = null;
String answer = "";fstream=new FileInputStream("D:\\folder\\poemt.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
int count = 1;
int songnum = 0;

while((strLine=br.readLine())!=null) {
    String text = strLine.replaceAll("[0-9]", ""); // Replace numbers from txt
    String nums = strLine.split("(?=\\D)")[0]; // get digits from strLine
    if (nums.matches(".*[0-9].*")) {
        songnum = Integer.parseInt(nums); // Parse string to int
    }
    String regex = ".*\\d+.*";
    boolean result = strLine.matches(regex);
    if (result == true) { // check if strLine contain digit
        count = 1;
    }
    answer = songnum + "." + count + "(" + text + ")";
    count++;
    sen.add(answer); // added songnum + line number and text to sen
}

for(int i = 0;i<sen.size();i++) { // loop to match and get word+poem number+line number
    for (int j = 0; j < ar.size(); j++) {
        if (sen.get(i).contains(ar.get(j))) {
            if (!ar.get(j).isEmpty()) {
                String x = ar.get(j) + " - " + sen.get(i);
                x = x.replaceAll("\\(.*\\)", ""); // replace single line sentence
                String[] sp = x.split("\\s+");
                word.add(sp[0]); // each word in the poem is added to the word arraylist
                fin.add(x); // word+poem number+line number
            }
        }
    }
}
Set<String> listWithoutDuplicates = new LinkedHashSet<String>(fin); // Remove duplicates
fin.clear();fin.addAll(listWithoutDuplicates);
Locale lithuanian = new Locale("ta");
Collator lithuanianCollator = Collator.getInstance(lithuanian); // sort array
Collections.sort(fin,lithuanianCollator);
System.out.println(fin);   


    (change in blossom. - 0.2,1.2, &  the - 0.1,1.2, & then - 0.1,1.2)

ArrayList ar=new ArrayList（）；
ArrayList sen=新的ArrayList（）；
ArrayList fin=新的ArrayList（）；
ArrayList word=新的ArrayList（）；
字符串内容=新字符串（Files.readAllBytes（path.get（“D:\\folder\\poem.txt”）），StandardCharsets.UTF_8）；
String[]split=content.split（\\s”）；//分割文本文件内容
for（字符串b:拆分）{
ar.add（b）；//添加到ar数组列表//ar包含每一行诗
}
FileInputStream fstream=null；
字符串答案=”；fstream=newfileinputstream（“D:\\folder\\poemt.txt”）；
BufferedReader br=新的BufferedReader（新的InputStreamReader（fstream））；
弦斯特林；
整数计数=1；
int-songnum=0；
而（（strLine=br.readLine（））！=null）{
String text=strLine.replaceAll（“[0-9]”，“”）；//替换txt中的数字
字符串nums=strLine.split（（？=\\D）”）[0]；//从strLine获取数字
if（nums.matches（“%0-9]”）{
songnum=Integer.parseInt（nums）；//将字符串解析为int
}
字符串regex=“.\\d+.*”；
布尔结果=strLine.matches（regex）；
如果（result==true）{//检查strLine是否包含数字
计数=1；
}
答案=songnum+“+count+”（“+text+”）；
计数++；
sen.add（answer）；//将songnum+行号和文本添加到sen
}
对于（int i=0；i我将首先复制粘贴示例的预期输出，然后查看代码以了解如何更改它：
Poem.txt
预期产量
由于注释中含有@Pal注释，一些单词（the
、和
）未被索引。出于索引目的，可能会忽略这些单词
代码的当前输出为
[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]

因此，假设您修复了stopwords，您实际上已经非常接近了。您的fin
数组包含word+诗歌编号+行号
，但它应该包含word+*list*诗歌编号+行号
。有几种方法可以修复此问题。首先，我们需要删除stopwords：
// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x); 

这给了我[blossom.-0.2,1.2，comed，-0.1，day-0.1，pauld-1.1，reserve-0.2，to-0.2]
。我还没有修改停止字列表以获得完美匹配，但这应该很容易做到。
我将首先复制粘贴示例的预期输出，然后检查代码以找到如何更改它：
Poem.txt
预期产量
由于注释中含有@Pal注释，一些单词（the
、和
）未被索引。出于索引目的，可能会忽略这些单词
代码的当前输出为
[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]

因此，假设您修复了stopwords，您实际上已经非常接近了。您的fin
数组包含word+诗歌编号+行号
，但它应该包含word+*list*诗歌编号+行号
。有几种方法可以修复此问题。首先，我们需要删除stopwords：
// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x); 

这给了我[开花-0.2,1.2，来了-0.1，第-0.1天，痛苦-1.1，保持-0.2，到-0.2]
。我还没有确定要获得完美匹配的停止词列表，但这应该很容易做到。很难阅读您的代码。请重新格式化它。不过，测试用例确实很有帮助。这个问题可能需要更多的投票才能引起足够的注意。请详细说明单词索引。测试用例显示单词“And”，“更多”不被考虑。为什么？这回答了你的问题吗？@Pal Laden实际上在0和之间没有空格。我用来拆分单词。我的原始文件有空格。这是我的错误，我忘了在那里留空格：（阅读代码很困难。请重新格式化。不过，测试用例确实有帮助。这个问题可能需要更多的投票才能引起足够的注意。请详细说明单词索引。测试用例显示单词“和”，“更多”“没有考虑。为什么？这回答了你的问题吗？@Pal Laden实际上0和之间没有空格。我用来拆分单词。我的原始文件有空格。这是我的错误，我忘了在那里留空格：(
List<String> fixed = new ArrayList<>();
String prevWord = "";
String prevLocs = "";
for (String s : fin) {
    String[] parts = s.split(" - ");
    if (parts[0].equals(prevWord)) {
        prevLocs += "," + parts[1];
    } else {
        if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
        prevWord = parts[0];
        prevLocs = parts[1];
    }
}
// last iteration
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);

System.out.println(fixed);

// build stopwords
String[] stopWords = new String[]{ "and", "a", "the", "to", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

// prepare always-sorted, quick-lookup set of terms
Collator lithuanianCollator = Collator.getInstance(new Locale("ta"));
Map<String, List<String>> terms = new TreeMap<>((o1, o2) -> lithuanianCollator.compare(o1, o2));

// read lines; if line starts with number, store separately
Pattern countPattern = Pattern.compile("([0-9]+)\\.(.*)");
String content = new String(Files.readAllBytes(Paths.get("/tmp/poem.txt")), StandardCharsets.UTF_8);
int poemCount = 0;
int lineCount = 1;
for (String line: content.split("[\n\r]+")) {
    line = line.toLowerCase().trim(); // remove spaces on both sides

    // update locations
    Matcher m = countPattern.matcher(line);
    if (m.matches()) {
        poemCount = Integer.parseInt(m.group(1));
        lineCount = 1;
        line = m.group(2); // ignore number for word-finding purposes
    } else {
        lineCount ++;
    }

    // read words in line, with locations already taken care of
    for (String word: line.split(" ")) {
        if ( ! toIgnore.contains(word)) {
            if ( ! terms.containsKey(word)) {
                terms.put(word, new ArrayList<>());
            }
            terms.get(word).add(poemCount + "." + lineCount);
        }
    }
}

// output formatting to match that of your code
List<String> output = new ArrayList<>();
for (Map.Entry<String, List<String>> e: terms.entrySet()) {
    output.add(e.getKey() + " - " + String.join(",", e.getValue()));
}
System.out.println(output);