java字符串搜索过程中突然减速和java.lang.OutOfMemoryError_Java_String_Search_Memory_Java.util.scanner

java字符串搜索过程中突然减速和java.lang.OutOfMemoryError

java string search memory

java字符串搜索过程中突然减速和java.lang.OutOfMemoryError,java,string,search,memory,java.util.scanner,Java,String,Search,Memory,Java.util.scanner,我正在写一个在RNA序列中发现模式的程序，这个程序大部分是有效的。为了在序列中找到“模式”，我正在生成一些可能的模式，并扫描所有序列的输入文件（算法还有更多内容，但这是中断的部分）。可能生成的图案具有用户给定的指定长度这适用于所有长度不超过8个字符的序列。然后在9，程序运行很长时间，然后给出java.lang.OutOfMemoryError。经过一些调试，我发现缺点是模式生成方法： /* Get elementary pattern (ep) substrings, to later com

我正在写一个在RNA序列中发现模式的程序，这个程序大部分是有效的。为了在序列中找到“模式”，我正在生成一些可能的模式，并扫描所有序列的输入文件（算法还有更多内容，但这是中断的部分）。可能生成的图案具有用户给定的指定长度

这适用于所有长度不超过8个字符的序列。然后在9，程序运行很长时间，然后给出java.lang.OutOfMemoryError。经过一些调试，我发现缺点是模式生成方法：

/* Get elementary pattern (ep) substrings, to later combine into full patterns */
public static void init_ep_subs(int length) {

ep_subs = new ArrayList<Substring>(); // clear static ep_subs data field

/* ep subs are of the form C1...C2...C3 where C1, C2, C3 are characters in the
   alphabet and the whole length of the string is equal to the input parameter
   'length'. The number of dots varies for different lengths.
The middle character C2 can occur instead of any dot, or not at all.*/

for (int i = 1; i < length-1; i++) { // for each potential position of C2

    // for each alphabet character to be C1
    for (int first = 0; first < alphabet.length; first++) { 

    // for each alphabet character to be C3
    for (int last = 0; last < alphabet.length; last++) {

        // make blank pattern, i.e. no C2
        Substring s_blank = new Substring(-1, alphabet[first],
                          '0', alphabet[last]);

        // get its frequency in the input string
        s_blank.occurrences = search_sequences(s_blank.toString());

        // if blank ep is found frequently enough in the input string, store it
        if (s_blank.frequency()>=nP) ep_subs.add(s_blank);

        // when C2 is present, for each character it could be
        for (int mid = 0; mid < alphabet.length; mid++) {

        // make pattern C1,C2,C3
        Substring s = new Substring(i, alphabet[first],
                        alphabet[mid],
                        alphabet[last]);

        // search input string for pattern s
        s.occurrences = search_sequences(s.toString());

        // if s is frequent enough, store it
        if (s.frequency()>=nP) ep_subs.add(s);
        }
    }
    }
}
}

/*获取基本模式（ep）子字符串，以便以后组合成完整模式*/
公共静态void init_ep_subs（int-length）{
ep_subs=new ArrayList（）；//清除静态ep_subs数据字段
/*ep SUB的形式为C1…C2…C3，其中C1、C2、C3是
字母表和字符串的整个长度等于输入参数
“长度”。不同长度的点的数量不同。
中间字符C2可以代替任何点出现，也可以根本不出现*/
对于（int i=1；i=nP）ep_subs.add（s_blank）；
//当C2存在时，对于每个字符都可以
对于（int mid=0；mid=nP）ep_subs.add（s）；
}
}
}
}
}

发生的情况如下：当我对搜索_序列的调用计时时，它们开始的时间大约为40-100ms，然后按照这种方式搜索第一个模式。然后在经历了几百种模式（大约‘C…..G.C’）之后，这些呼叫突然开始花费大约十倍的时间，1000-2000ms。之后，时间稳步增加，直到大约12000ms（'C……TA'），它给出了以下错误：

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.<init>(String.java:215)
    at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
    at java.nio.CharBuffer.toString(CharBuffer.java:1157)
    at java.util.regex.Matcher.toMatchResult(Matcher.java:232)
    at java.util.Scanner.match(Scanner.java:1270)
    at java.util.Scanner.hasNextLine(Scanner.java:1478)
    at PatternFinder4.search_sequences(PatternFinder4.java:217)
    at PatternFinder4.init_ep_subs(PatternFinder4.java:256)
    at PatternFinder4.main(PatternFinder4.java:62)

线程“main”java.lang.OutOfMemoryError中的异常：java堆空间位于java.util.Arrays.copyOfRange（Arrays.java:3209）位于java.lang.String。（String.java:215）位于java.nio.HeapCharBuffer.toString（HeapCharBuffer.java:542）位于java.nio.CharBuffer.toString（CharBuffer.java:1157）位于java.util.regex.Matcher.toMatchResult（Matcher.java:232）位于java.util.Scanner.match（Scanner.java:1270）位于java.util.Scanner.hasNextLine（Scanner.java:1478）在PatternFinder4.search_序列（PatternFinder4.java:217）在PatternFinder4.init_ep_subs（PatternFinder4.java:256）位于PatternFinder4.main（PatternFinder4.java:62）这是搜索顺序方法：

/* Searches the input string 'sequences' for occurrences of the parameter string 'sub' */
public static ArrayList<int[]> search_sequences(String sub) {

/* arraylist returned holding int arrays with coordinates of the places where 'sub'
 was found, i.e. {l,i} l = lines number, i = index within line */
ArrayList<int[]> occurrences = new ArrayList<int[]>();
s = new Scanner(sequences);
int line_index = 0;

String line = "";
while (s.hasNextLine()) {
    line = s.nextLine();
    pattern = Pattern.compile(sub);
    matcher = pattern.matcher(line);
    pattern = null; // all the =nulls were intended to help memory management, had no effect

    int index = 0;

    // for each occurrence of 'sub' in the line being scanned
    while (matcher.find(index)) {
    int start = matcher.start(); // get the index of the next occurrence
    int[] occurrence = {line_index, start}; // make up the coordinate array
    occurrences.add(occurrence); // store that occurrence
    index = start+1; // start looking from after the last occurence found
    }
    matcher=null;
    line=null;
    line_index++;

}
s=null;

return occurrences;
}

/*在输入字符串“序列”中搜索参数字符串“sub”的出现情况*/
公共静态ArrayList搜索_序列（字符串子）{
/*arraylist返回了带有“sub”所在位置坐标的整型数组
已找到，即{l，i}l=行数，i=行内索引*/
ArrayList引用=新建ArrayList（）；
s=新扫描仪（序列）；
int line_index=0；
字符串行=”；
而（s.hasNextLine（））{
line=s.nextLine（）；
pattern=pattern.compile（sub）；
匹配器=模式匹配器（线）；
pattern=null；//所有的=null都是用来帮助内存管理的，没有效果
int指数=0；
//对于正在扫描的行中每次出现的“sub”
while（匹配器查找（索引））{
int start=matcher.start（）；//获取下一次出现的索引
int[]occurrence={line_index，start}；//组成坐标数组
事件。添加（事件）；//存储该事件
index=start+1；//在找到最后一个事件后开始查找
}
matcher=null；
行=空；
line_index++；
}
s=零；
返回事件；
}

我在两台速度不同的计算机上尝试过这个程序，虽然在速度更快的计算机上完成搜索序列的实际时间更短，但相对时间是相同的；在大约相同的迭代次数下，搜索序列开始需要十倍的时间才能完成

我曾尝试在谷歌上搜索不同输入流（如BufferedReader等）的内存效率和速度，但普遍的共识似乎是，它们大致相当于扫描仪。关于这个bug是什么，或者我如何尝试自己找出它，你们有什么建议吗

如果有人想看到更多的代码，请询问

编辑：

1-输入文件“序列”是1000个蛋白质序列（每一行上），长度不等，大约有几百个字符。我还应该提到，这个程序将/只需要工作/达到长度为9的模式

2-以下是上述代码中使用的子字符串类方法

static class Substring {
int residue; // position of the middle character C2
char front, mid, end; // alphabet characters for C1, C2 and C3
ArrayList<int[]> occurrences; // list of positions the substring occurs in 'sequences'
String string; // string representation of the substring

public Substring(int inresidue, char infront, char inmid, char inend) {
    occurrences = new ArrayList<int[]>();
    residue = inresidue;
    front = infront;
    mid = inmid;
    end = inend;
    setString(); // makes the string representation using characters and their positions
}

/* gets the frequency of the substring given the places it occurs in 'sequences'. 
   This only counts the substring /once per line ist occurs in/. */
public int frequency() {
    return PatternFinder.frequency(occurrences);
}

public String toString() {
    return string;
}

/* makes the string representation using the substring's characters and their positions */
private void setString() {
    if (residue>-1) {
    String left_mid = "";
    for (int j = 0; j < residue-1; j++) left_mid += ".";
    String right_mid = "";
    for (int j = residue+1; j < length-1; j++) right_mid += ".";
    string = front + left_mid + mid + right_mid + end;
    } else {
    String mid = "";
    for (int i = 0; i < length-2; i++) mid += ".";
    string = front + mid + end;
    }
}
 }

静态类子字符串{
int剩余；//中间字符C2的位置
字符前面、中间、结尾；//C1、C2和C3的字母表字符
ArrayList引用；//子字符串在“序列”中出现的位置列表
String；//子字符串的字符串表示形式
公共子字符串（int-inresidue、char-infront、char-inmid、char-inend）{
引用=新的ArrayList（）；
残余=不溶物；
public static int frequency(ArrayList<int[]> occurrences) {
    HashSet<String> lines_present = new HashSet<String>();
    for (int[] occurrence : occurrences) {
        lines_present.add(new String(occurrence[0]+""));
    }
    return lines_present.size();
    }