Java 培训OpenNLP时超出了GC开销限制';纳姆芬德姆酒店

Java 培训OpenNLP时超出了GC开销限制';纳姆芬德姆酒店,java,stream,garbage-collection,nlp,opennlp,Java,Stream,Garbage Collection,Nlp,Opennlp,我想使用NameFinderME获得提取名称的概率分数,但是使用提供的模型使用probs函数会给出非常糟糕的概率。 例如,“Scott F.Fitzgerald”的得分约为0.5(平均对数概率,取指数),而“北日本”和“执行副总裁、企业关系和首席慈善官”的得分均高于0.9 我有200多万个名字,还有200万个姓氏(加上它们的频率计数),我想从姓氏X中间名(使用名字池)X姓氏的外部乘法中综合创建一个巨大的数据集 问题是,在我得到一个超出GC开销限制的异常之前,我甚至不需要检查所有姓氏一次(即使丢弃

我想使用NameFinderME获得提取名称的概率分数,但是使用提供的模型使用probs函数会给出非常糟糕的概率。 例如,“Scott F.Fitzgerald”的得分约为0.5(平均对数概率,取指数),而“北日本”和“执行副总裁、企业关系和首席慈善官”的得分均高于0.9

我有200多万个名字,还有200万个姓氏(加上它们的频率计数),我想从姓氏X中间名(使用名字池)X姓氏的外部乘法中综合创建一个巨大的数据集

问题是,在我得到一个超出GC开销限制的异常之前,我甚至不需要检查所有姓氏一次(即使丢弃频率计数并且只使用每个姓氏一次)

我正在实现一个ObjectStream并将其提供给train函数:

public class OpenNLPNameStream implements ObjectStream<NameSample> {

    private List<Map<String, Object>> firstNames = null;
    private List<Map<String, Object>> lastNames = null;
    private int firstNameIdx = 0;
    private int firstNameCountIdx = 0;
    private int middleNameIdx = 0;
    private int middleNameCountIdx = 0;
    private int lastNameIdx = 0;
    private int lastNameCountIdx = 0;

    private int firstNameMaxCount = 0;
    private int middleNameMaxCount = 0;
    private int lastNameMaxCount = 0;

    private int firstNameKBSize = 0;
    private int lastNameKBSize = 0;
    Span span[] = new Span[1];
    String fullName[] = new String[3];
    String partialName[] = new String[2];

    private void increaseFirstNameCountIdx()
    {
        firstNameCountIdx++;
        if (firstNameCountIdx == firstNameMaxCount) {
            firstNameIdx++;
            if (firstNameIdx == firstNameKBSize)
                return; //no need to update anything - this is the end of the run...
            firstNameMaxCount = getFirstNameMaxCount(firstNameIdx);
            firstNameCountIdx = 0;
        }
    }

    private void increaseMiddleNameCountIdx()
    {
        lastNameCountIdx++;
        if (middleNameCountIdx == middleNameMaxCount) {
            if (middleNameIdx == firstNameKBSize) {
                resetMiddleNameIdx();
                increaseFirstNameCountIdx();
            } else {
                middleNameMaxCount = getMiddleNameMaxCount(middleNameIdx);
                middleNameCountIdx = 0;
            }
        }
    }

    private void increaseLastNameCountIdx()
    {
        lastNameCountIdx++;
        if (lastNameCountIdx == lastNameMaxCount) {
            lastNameIdx++;
            if (lastNameIdx == lastNameKBSize) {
                resetLastNameIdx();
                increaseMiddleNameCountIdx();
            }
            else {
                lastNameMaxCount = getLastNameMaxCount(lastNameIdx);
                lastNameCountIdx = 0;
            }
        }
    }

    private void resetLastNameIdx()
    {
        lastNameIdx = 0;
        lastNameMaxCount = getLastNameMaxCount(0);
        lastNameCountIdx = 0;
    }


    private void resetMiddleNameIdx()
    {
        middleNameIdx = 0;
        middleNameMaxCount = getMiddleNameMaxCount(0);
        middleNameCountIdx = 0;
    }

    private int getFirstNameMaxCount(int i)
    {
        return 1; //compromised on using just 
        //String occurences = (String) firstNames.get(i).get("occurences");
        //return Integer.parseInt(occurences);
    }         

    private int getMiddleNameMaxCount(int i)
    {
        return 3; //compromised on using just 
        //String occurences = (String) firstNames.get(i).get("occurences");
        //return Integer.parseInt(occurences);
    }

    private int getLastNameMaxCount(int i)
    {
        return 1;
        //String occurences = (String) lastNames.get(i).get("occurences");
        //return Integer.parseInt(occurences);
    }

    @Override
    public NameSample read() throws IOException {
        if (firstNames == null) {
            firstNames = CSVFileTools.readFileFromInputStream("namep_first_name_idf.csv", new ClassPathResource("namep_first_name_idf.csv").getInputStream());
            firstNameKBSize = firstNames.size();
            firstNameMaxCount = getFirstNameMaxCount(0);
            middleNameMaxCount = getFirstNameMaxCount(0);
        }
        if (lastNames == null) {
            lastNames = CSVFileTools.readFileFromInputStream("namep_last_name_idf.csv",new ClassPathResource("namep_last_name_idf.csv").getInputStream());
            lastNameKBSize = lastNames.size();
            lastNameMaxCount = getLastNameMaxCount(0);
        }
        increaseLastNameCountIdx();;
        if (firstNameIdx == firstNameKBSize)
            return null; //we've finished iterating over all permutations!

        String [] sentence;
        if (firstNameCountIdx < firstNameMaxCount / 3)
        {
            span[0] = new Span(0,2,"Name");
            sentence = partialName;
            sentence[0] = (String)firstNames.get(firstNameIdx).get("first_name");
            sentence[1] = (String)lastNames.get(lastNameIdx).get("last_name");
        }
        else
        {
            span[0] = new Span(0,3,"name");
            sentence = fullName;
            sentence[0] = (String)firstNames.get(firstNameIdx).get("first_name");
            sentence[2] = (String)lastNames.get(lastNameIdx).get("last_name");
            if (firstNameCountIdx < 2*firstNameCountIdx/3) {
                sentence[1] = (String)firstNames.get(middleNameIdx).get("first_name");
            }
            else {
                sentence[1] = ((String)firstNames.get(middleNameIdx).get("first_name")).substring(0,1) + ".";
            }
        }

        return new NameSample(sentence,span,true);
    }

    @Override
    public void reset() throws IOException, UnsupportedOperationException {
        firstNameIdx = 0;
        firstNameCountIdx = 0;
        middleNameIdx = 0;
        middleNameCountIdx = 0;
        lastNameIdx = 0;
        lastNameCountIdx = 0;

        firstNameMaxCount = 0;
        middleNameMaxCount = 0;
        lastNameMaxCount = 0;
    }

    @Override
    public void close() throws IOException {
        reset();
        firstNames = null;
        lastNames = null;
    }
}
运行几分钟后,我发现以下错误:

java.lang.OutOfMemoryError: GC overhead limit exceeded

    at opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(WindowFeatureGenerator.java:112)
    at opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(AggregatedFeatureGenerator.java:79)
    at opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(CachedFeatureGenerator.java:69)
    at opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameContextGenerator.java:118)
    at opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameContextGenerator.java:37)
    at opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEventStream.java:113)
    at opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventStream.java:137)
    at opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventStream.java:36)
    at opennlp.tools.util.AbstractEventStream.read(AbstractEventStream.java:62)
    at opennlp.tools.util.AbstractEventStream.read(AbstractEventStream.java:27)
    at opennlp.tools.util.AbstractObjectStream.read(AbstractObjectStream.java:32)
    at opennlp.tools.ml.model.HashSumEventStream.read(HashSumEventStream.java:46)
    at opennlp.tools.ml.model.HashSumEventStream.read(HashSumEventStream.java:29)
    at opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:130)
    at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:83)
    at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
    at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
    at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)
java.lang.OutOfMemoryError:超出GC开销限制
在opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(WindowFeatureGenerator.java:112)中
在opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(AggregatedFeatureGenerator.java:79)中
在opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(CachedFeatureGenerator.java:69)上
在opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameContextGenerator.java:118)中
在opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameContextGenerator.java:37)
位于opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEventStream.java:113)
位于opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventStream.java:137)
位于opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventStream.java:36)
位于opennlp.tools.util.AbstractEventStream.read(AbstractEventStream.java:62)
位于opennlp.tools.util.AbstractEventStream.read(AbstractEventStream.java:27)
位于opennlp.tools.util.AbstractObjectStream.read(AbstractObjectStream.java:32)
在opennlp.tools.ml.model.HashSumEventStream.read(HashSumEventStream.java:46)上
在opennlp.tools.ml.model.HashSumEventStream.read(HashSumEventStream.java:29)上
位于opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:130)
在opennlp.tools.ml.model.TwoPassDataIndexer。(TwoPassDataIndexer.java:83)
位于opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
位于opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
位于opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)
编辑:在将JVM的内存增加到8GB之后,我仍然无法通过前200万个姓氏,但现在例外是:

java.lang.OutOfMemoryError: Java heap space

    at java.util.HashMap.resize(HashMap.java:703)
    at java.util.HashMap.putVal(HashMap.java:662)
    at java.util.HashMap.put(HashMap.java:611)
    at opennlp.tools.ml.model.AbstractDataIndexer.update(AbstractDataIndexer.java:141)
    at opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:134)
    at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:83)
    at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
    at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
    at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)
java.lang.OutOfMemoryError:java堆空间
在java.util.HashMap.resize(HashMap.java:703)中
在java.util.HashMap.putVal(HashMap.java:662)
位于java.util.HashMap.put(HashMap.java:611)
在opennlp.tools.ml.model.AbstractDataIndexer.update(AbstractDataIndexer.java:141)上
位于opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:134)
在opennlp.tools.ml.model.TwoPassDataIndexer。(TwoPassDataIndexer.java:83)
位于opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
位于opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
位于opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)
问题似乎源于这样一个事实:我正在创建一个新的名称示例,并在每次读取调用时创建新的跨距和字符串。。。但我不能重用跨度或名称样本,因为它们是不可变的

我是否应该编写我自己的语言模型,是否有更好的Java库来做这类事情(我只想知道提取的文本实际上是一个名称的概率)是否有我应该为我正在训练的模型调整的参数


如有任何建议,将不胜感激

您使用什么内存设置?使用
-Xmx
标志为JVM提供更多内存如何?@haraldK将其增加到8GB,现在错误有点不同:在AbstractDataIndexer.update期间尝试调整HashMap大小时出现Java堆空间错误…。@Mr.shalme编写自己的语言模型会更好@谢谢,我今天写的:)你使用什么内存设置?使用
-Xmx
标志为JVM提供更多内存如何?@haraldK将其增加到8GB,现在错误有点不同:在AbstractDataIndexer.update期间尝试调整HashMap大小时出现Java堆空间错误…。@Mr.shalme编写自己的语言模型会更好@谢谢,我今天写的:)
java.lang.OutOfMemoryError: Java heap space

    at java.util.HashMap.resize(HashMap.java:703)
    at java.util.HashMap.putVal(HashMap.java:662)
    at java.util.HashMap.put(HashMap.java:611)
    at opennlp.tools.ml.model.AbstractDataIndexer.update(AbstractDataIndexer.java:141)
    at opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:134)
    at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:83)
    at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
    at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
    at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)