Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/311.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Java为文本文件内容中的每个单词编制索引_Java_Arrays_Arraylist - Fatal编程技术网

使用Java为文本文件内容中的每个单词编制索引

使用Java为文本文件内容中的每个单词编制索引,java,arrays,arraylist,Java,Arrays,Arraylist,我正在尝试使用java为文本文件中的每个单词编制索引 索引意味着我在这里指的是单词的索引 这是我的示例文件 (我要索引的实际文件要大得多) 这是我迄今为止尝试过的代码 ArrayList<String> ar = new ArrayList<String>(); ArrayList<String> sen = new ArrayList<String>(); ArrayList<String> fin = new ArrayList&l

我正在尝试使用java为文本文件中的每个单词编制索引

索引意味着我在这里指的是单词的索引

这是我的示例文件 (我要索引的实际文件要大得多)

这是我迄今为止尝试过的代码

ArrayList<String> ar = new ArrayList<String>();
ArrayList<String> sen = new ArrayList<String>();
ArrayList<String> fin = new ArrayList<String>();
ArrayList<String> word = new ArrayList<String>();
String content = new String(Files.readAllBytes(Paths.get("D:\\folder\\poem.txt")), StandardCharsets.UTF_8);

String[] split = content.split("\\s"); // Split text file content
for(String b:split) {
    ar.add(b); // added into the ar arraylist //ar contains every line of poem
}
FileInputStream fstream = null;
String answer = "";fstream=new FileInputStream("D:\\folder\\poemt.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
int count = 1;
int songnum = 0;

while((strLine=br.readLine())!=null) {
    String text = strLine.replaceAll("[0-9]", ""); // Replace numbers from txt
    String nums = strLine.split("(?=\\D)")[0]; // get digits from strLine
    if (nums.matches(".*[0-9].*")) {
        songnum = Integer.parseInt(nums); // Parse string to int
    }
    String regex = ".*\\d+.*";
    boolean result = strLine.matches(regex);
    if (result == true) { // check if strLine contain digit
        count = 1;
    }
    answer = songnum + "." + count + "(" + text + ")";
    count++;
    sen.add(answer); // added songnum + line number and text to sen
}

for(int i = 0;i<sen.size();i++) { // loop to match and get word+poem number+line number
    for (int j = 0; j < ar.size(); j++) {
        if (sen.get(i).contains(ar.get(j))) {
            if (!ar.get(j).isEmpty()) {
                String x = ar.get(j) + " - " + sen.get(i);
                x = x.replaceAll("\\(.*\\)", ""); // replace single line sentence
                String[] sp = x.split("\\s+");
                word.add(sp[0]); // each word in the poem is added to the word arraylist
                fin.add(x); // word+poem number+line number
            }
        }
    }
}
Set<String> listWithoutDuplicates = new LinkedHashSet<String>(fin); // Remove duplicates
fin.clear();fin.addAll(listWithoutDuplicates);
Locale lithuanian = new Locale("ta");
Collator lithuanianCollator = Collator.getInstance(lithuanian); // sort array
Collections.sort(fin,lithuanianCollator);
System.out.println(fin);   


    (change in blossom. - 0.2,1.2, &  the - 0.1,1.2, & then - 0.1,1.2)
ArrayList ar=new ArrayList();
ArrayList sen=新的ArrayList();
ArrayList fin=新的ArrayList();
ArrayList word=新的ArrayList();
字符串内容=新字符串(Files.readAllBytes(path.get(“D:\\folder\\poem.txt”)),StandardCharsets.UTF_8);
String[]split=content.split(\\s”);//分割文本文件内容
for(字符串b:拆分){
ar.add(b);//添加到ar数组列表//ar包含每一行诗
}
FileInputStream fstream=null;
字符串答案=”;fstream=newfileinputstream(“D:\\folder\\poemt.txt”);
BufferedReader br=新的BufferedReader(新的InputStreamReader(fstream));
弦斯特林;
整数计数=1;
int-songnum=0;
而((strLine=br.readLine())!=null){
String text=strLine.replaceAll(“[0-9]”,“”);//替换txt中的数字
字符串nums=strLine.split((?=\\D)”)[0];//从strLine获取数字
if(nums.matches(“%0-9]”){
songnum=Integer.parseInt(nums);//将字符串解析为int
}
字符串regex=“.\\d+.*”;
布尔结果=strLine.matches(regex);
如果(result==true){//检查strLine是否包含数字
计数=1;
}
答案=songnum+“+count+”(“+text+”);
计数++;
sen.add(answer);//将songnum+行号和文本添加到sen
}

对于(int i=0;i我将首先复制粘贴示例的预期输出,然后查看代码以了解如何更改它:

Poem.txt

预期产量

由于注释中含有@Pal注释,一些单词(
the
)未被索引。出于索引目的,可能会忽略这些单词

代码的当前输出为

[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]
因此,假设您修复了stopwords,您实际上已经非常接近了。您的
fin
数组包含
word+诗歌编号+行号
,但它应该包含
word+*list*诗歌编号+行号
。有几种方法可以修复此问题。首先,我们需要删除stopwords:

// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x); 

这给了我
[blossom.-0.2,1.2,comed,-0.1,day-0.1,pauld-1.1,reserve-0.2,to-0.2]
。我还没有修改停止字列表以获得完美匹配,但这应该很容易做到。

我将首先复制粘贴示例的预期输出,然后检查代码以找到如何更改它:

Poem.txt

预期产量

由于注释中含有@Pal注释,一些单词(
the
)未被索引。出于索引目的,可能会忽略这些单词

代码的当前输出为

[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]
因此,假设您修复了stopwords,您实际上已经非常接近了。您的
fin
数组包含
word+诗歌编号+行号
,但它应该包含
word+*list*诗歌编号+行号
。有几种方法可以修复此问题。首先,我们需要删除stopwords:

// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x); 

这给了我
[开花-0.2,1.2,来了-0.1,第-0.1天,痛苦-1.1,保持-0.2,到-0.2]
。我还没有确定要获得完美匹配的停止词列表,但这应该很容易做到。

很难阅读您的代码。请重新格式化它。不过,测试用例确实很有帮助。这个问题可能需要更多的投票才能引起足够的注意。请详细说明单词索引。测试用例显示单词“And”,“更多”不被考虑。为什么?这回答了你的问题吗?@Pal Laden实际上在0和之间没有空格。我用来拆分单词。我的原始文件有空格。这是我的错误,我忘了在那里留空格:(阅读代码很困难。请重新格式化。不过,测试用例确实有帮助。这个问题可能需要更多的投票才能引起足够的注意。请详细说明单词索引。测试用例显示单词“和”,“更多”“没有考虑。为什么?这回答了你的问题吗?@Pal Laden实际上0和之间没有空格。我用来拆分单词。我的原始文件有空格。这是我的错误,我忘了在那里留空格:(
List<String> fixed = new ArrayList<>();
String prevWord = "";
String prevLocs = "";
for (String s : fin) {
    String[] parts = s.split(" - ");
    if (parts[0].equals(prevWord)) {
        prevLocs += "," + parts[1];
    } else {
        if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
        prevWord = parts[0];
        prevLocs = parts[1];
    }
}
// last iteration
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);

System.out.println(fixed);
// build stopwords
String[] stopWords = new String[]{ "and", "a", "the", "to", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

// prepare always-sorted, quick-lookup set of terms
Collator lithuanianCollator = Collator.getInstance(new Locale("ta"));
Map<String, List<String>> terms = new TreeMap<>((o1, o2) -> lithuanianCollator.compare(o1, o2));

// read lines; if line starts with number, store separately
Pattern countPattern = Pattern.compile("([0-9]+)\\.(.*)");
String content = new String(Files.readAllBytes(Paths.get("/tmp/poem.txt")), StandardCharsets.UTF_8);
int poemCount = 0;
int lineCount = 1;
for (String line: content.split("[\n\r]+")) {
    line = line.toLowerCase().trim(); // remove spaces on both sides

    // update locations
    Matcher m = countPattern.matcher(line);
    if (m.matches()) {
        poemCount = Integer.parseInt(m.group(1));
        lineCount = 1;
        line = m.group(2); // ignore number for word-finding purposes
    } else {
        lineCount ++;
    }

    // read words in line, with locations already taken care of
    for (String word: line.split(" ")) {
        if ( ! toIgnore.contains(word)) {
            if ( ! terms.containsKey(word)) {
                terms.put(word, new ArrayList<>());
            }
            terms.get(word).add(poemCount + "." + lineCount);
        }
    }
}

// output formatting to match that of your code
List<String> output = new ArrayList<>();
for (Map.Entry<String, List<String>> e: terms.entrySet()) {
    output.add(e.getKey() + " - " + String.join(",", e.getValue()));
}
System.out.println(output);