在java中有效地过滤字符串_Java_String_Eclipse_Indexing_Hashmap

在java中有效地过滤字符串

java string eclipse indexing

在java中有效地过滤字符串,java,string,eclipse,indexing,hashmap,Java,String,Eclipse,Indexing,Hashmap,我现在正试着做一个迷你搜索引擎。我的目标是在hashmap中索引一组文件，但首先我需要执行两个操作，包括降低大写字母，删除所有不必要的单词，以及删除除a-z/a-z之外的所有字符。现在我的实现如下所示： String article = ""; for (File file : dir.listFiles()) { //for each file (001.txt, 002.txt...) Scanner s = null; try {

我现在正试着做一个迷你搜索引擎。我的目标是在hashmap中索引一组文件，但首先我需要执行两个操作，包括降低大写字母，删除所有不必要的单词，以及删除除a-z/a-z之外的所有字符。现在我的实现如下所示：

String article = "";

for (File file : dir.listFiles()) { //for each file (001.txt, 002.txt...)
        Scanner s = null;
        try {
            s = new Scanner(file);
            while (s.hasNext())
                article += s.next().toLowerCase(Locale.ROOT) + " "; //converting all characters to lower case
            article = currentWord.replaceAll(delimiters.get()," "); //removing punctuations (?, -, !, * etc...) 

            String splittedWords = article.split(" ");  //splitting each word into a string array
            for(int i = 0; i < splittedWords.length; i++) {
                s = new Scanner(stopwords);
                boolean flag = true;
                while(s.hasNextLine())
                    if (splittedWords[i].equals(s.nextLine())) { //comparing each word with all the stop words (words like a, the, already, these etc...) taken from another big txt file and removing them, because we dont need to fill our map with unnecessary words, to provide faster search times later on
                        flag = false;
                        break;
                    }
                if(flag) map.put(splittedWords[i], file.getName()); //if current word in splittedWords array does not match any stop word, put it in the hashmap        


            }
            s.close();


        } catch (FileNotFoundException e) {

            e.printStackTrace();
        }
        s.close();
        System.out.println(file);
    }

这只是我代码中的一个块，它可能包含缺失的部分，我用注释粗略地解释了我的算法。使用.contains方法检查stopWords是否包含任何CurrentWords，尽管这是一种更快的方法，但它不会映射像death这样的单词，因为它包含来自stopWords列表的at。

我正在尽我最大的努力使它更有效，但我没有取得多大进展。每个包含约300个单词的文件都需要约3秒的时间来编制索引，考虑到我有上万个文件，这并不理想。关于如何改进算法以使其运行更快，有什么想法吗？

有一些改进：

首先，请不要使用新的ScannerFile构造函数，因为它使用无缓冲I/O。小型磁盘读取操作（尤其是在HDD上）非常无效。例如，使用具有65 KB缓冲区的BufferedInputStream：

try (Scanner s = new Scanner(new BufferedInputStream(new FileInputStream(f), 65536))) {
    // your code
}

第二：你的电脑很可能有一个多代码CPU。因此，您可以并行扫描多个文件。为此，您必须确保使用多线程感知映射。将地图的定义更改为：

Map<String,String> map = new ConcurrentHashMap<>();

根据系统中的CPU内核，它将同时处理多个文件。特别是如果您处理大量文件，这将大大减少程序的运行时间

最后，您的实现相当复杂。使用Scanner的输出创建一个新字符串，然后再次拆分该字符串。相反，最好是配置扫描仪来直接考虑你想要的分隔符：

try (Scanner s = new Scanner(....).useDelimiter("[ ,\\!\\-\\.\\?\\*]")) {

然后，您可以直接使用Scanner创建的令牌，而不必构建文章字符串，然后再拆分它

自己实现搜索引擎的原因是什么

对于生产，我推荐现有的解决方案——ApacheLucene，它完全符合您的任务

如果您只是在培训，那么有几个标准点可以改进您的代码

避免像本文那样在循环中串接+=。最好创建一个单词regexp并将其传递给Scanner。 Pattern p=Pattern.compile[A-Za-z]+；尝试扫描程序s=新扫描程序文件{ 而s.hasNextp{ 字符串字=s.nextp； word=word.toLowerCaseLocale.ROOT； ... } } 将所有stopwords放入hashmap，并使用containsKey方法检查每个新出现的单词

您正在读取每个源文件的停止字文件。您可以读取停止字文件一次，然后使用集合将停止字存储在内存中。

try (Scanner s = new Scanner(....).useDelimiter("[ ,\\!\\-\\.\\?\\*]")) {