Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/search/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java程序在为k-grams的语料库编制索引时突然减慢_Java - Fatal编程技术网

Java程序在为k-grams的语料库编制索引时突然减慢

Java程序在为k-grams的语料库编制索引时突然减慢,java,Java,我有一个困扰我的问题。我正在为文本文件的语料库(17000个文件)编制索引,同时,我还将每个单词的所有k-gram(单词的k-long部分)存储在HashMap中,以供以后使用: public void insert( String token ) { //For example, car should result in "^c", "ca", "ar" and "r$" for a 2-gram index // Check if token has already

我有一个困扰我的问题。我正在为文本文件的语料库(17000个文件)编制索引,同时,我还将每个单词的所有k-gram(单词的k-long部分)存储在
HashMap
中,以供以后使用:

public void insert( String token ) {
    //For example, car should result in "^c", "ca", "ar" and "r$" for a 2-gram index

        // Check if token has already been seen. if it has, all the
        // k-grams for it have already been added.
        if (term2id.get(token) != null) {
            return;
        }

    id2term.put(++lastTermID, token);
    term2id.put(token, lastTermID);

        // is word long enough? for example, "a" can be bigrammed and trigrammed but not four-grammed.
        // K must be <= token.length + 2. "ab". K must be <= 4
        List<KGramPostingsEntry> postings = null;
        if(K > token.length() + 2) {
            return;
        }else if(K == token.length() + 2) {
            // insert the one K-gram "^<String token>$" into index
            String kgram = "^"+token+"$";
            postings = index.get(kgram);
            SortedSet<String> kgrams = new TreeSet<String>();
            kgrams.add(kgram);
            term2KGrams.put(token, kgrams);
            if (postings == null) {
                KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
                ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
                newList.add(newEntry);
                index.put("^"+token+"$", newList);
            }
            // No need to do anything if the posting already exists, so no else clause. There is only one possible term in this case
            // Return since we are done
            return;
        }else {
            // We get here if there is more than one k-gram in our term
            // insert all k-grams in token into index
            int start = 0;
            int end = start+K;
            //add ^ and $ to token.
            String wrappedToken = "^"+token+"$";
            int noOfKGrams = wrappedToken.length() - end + 1; 
            // get K-Grams
            String kGram;
            int startCurr, endCurr;
            SortedSet<String> kgrams = new TreeSet<String>();

            for (int i=0; i<noOfKGrams; i++) {

                startCurr = start + i;
                endCurr = end + i;

                kGram = wrappedToken.substring(startCurr, endCurr);
                kgrams.add(kGram);

                postings = index.get(kGram);
            KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
                // if this k-gram has been seen before
                if (postings != null) {
                    // Add this token to the existing postingsList.
                    // We can be sure that the list doesn't contain the token
                    // already, else we would previously have terminated the 
                    // execution of this function.
                    int lastTermInPostings = postings.get(postings.size()-1).tokenID;
                    if (lastTermID == lastTermInPostings) {
                        continue;
                    }
                    postings.add(newEntry);
                    index.put(kGram, postings);
                }
                // if this k-gram has not been seen before 
                else {
                    ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
                    newList.add(newEntry);
                    index.put(kGram, newList);
                }
            }

            Clock c = Clock.systemDefaultZone();
            long timestart = c.millis();

            System.out.println(token);
            term2KGrams.put(token, kgrams);

            long timestop = c.millis();
            System.out.printf("time taken to put: %d\n", timestop-timestart);
            System.out.print("put ");
            System.out.println(kgrams);
            System.out.println();

        }

}
在我看来这很好,推杆似乎不需要很长时间,而且k-gram(在本例中为三叉图)是正确的

但人们可以看到我的电脑打印这些信息的速度有着奇怪的行为。一开始,一切都是以超高速打印的。但是当速度达到15000时,这种速度就停止了,取而代之的是,我的电脑开始一次打印几行,这当然意味着为语料库的其他2000个文件编制索引将需要一段时间

我观察到的另一件有趣的事情是,在键盘中断(ctrl+c)时,它已经像前面描述的那样不稳定且缓慢地打印了一段时间。它给了我这样的信息:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.base/java.lang.StringLatin1.newString(StringLatin1.java:549)sahandzarrinkoub@Sahands-MBP:~/Documents/Programming/Information Retrieval/lab3 2$ sh compile_all.sh
Note: ir/PersistentHashedIndex.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
这是不是意味着我没有记忆了?这就是问题所在吗?如果是这样,那就令人惊讶了,因为我以前在内存中存储了很多东西,比如
HashMap
包含语料库中每个单词的文档ID,一个
HashMap
包含每个单词的每个k-gram出现的地方,等等


请告诉我您的想法以及我能做些什么来解决这个问题。

要理解这一点,您必须首先了解java不会动态分配内存(或者至少不会不精确地分配内存)。JVM默认配置为从最小堆大小和最大堆大小开始。当通过某些分配超过最大堆大小时,您会得到一个

您可以分别使用vm参数
-Xms
-Xmx
更改执行的最小和最大堆大小。例如,对于至少2 GB但最多4 GB的执行

java -Xms2g -Xmx4g ...
您可以在上找到更多选项


但是,在更改堆内存之前,请仔细查看系统资源,尤其是系统是否启动。如果您的系统交换,较大的堆大小可能会让程序运行更长时间,但性能也同样糟糕。唯一可行的方法是优化程序,以使用更少的内存或升级机器的RAM。

OutOfMemoryError
表示内存不足,是的。您可以使用
java-Xms2048m-Xmx4096m…
控制堆内存(这会将堆大小设置为最小2048MB,最大4096MB)。你可以找到更多的信息。但请注意实际内存:如果系统开始交换,则更大的堆内存将无济于事。
java -Xms2g -Xmx4g ...