Java 培训OpenNLP时超出了GC开销限制';纳姆芬德姆酒店
我想使用NameFinderME获得提取名称的概率分数,但是使用提供的模型使用probs函数会给出非常糟糕的概率。 例如,“Scott F.Fitzgerald”的得分约为0.5(平均对数概率,取指数),而“北日本”和“执行副总裁、企业关系和首席慈善官”的得分均高于0.9 我有200多万个名字,还有200万个姓氏(加上它们的频率计数),我想从姓氏X中间名(使用名字池)X姓氏的外部乘法中综合创建一个巨大的数据集 问题是,在我得到一个超出GC开销限制的异常之前,我甚至不需要检查所有姓氏一次(即使丢弃频率计数并且只使用每个姓氏一次) 我正在实现一个ObjectStream并将其提供给train函数:Java 培训OpenNLP时超出了GC开销限制';纳姆芬德姆酒店,java,stream,garbage-collection,nlp,opennlp,Java,Stream,Garbage Collection,Nlp,Opennlp,我想使用NameFinderME获得提取名称的概率分数,但是使用提供的模型使用probs函数会给出非常糟糕的概率。 例如,“Scott F.Fitzgerald”的得分约为0.5(平均对数概率,取指数),而“北日本”和“执行副总裁、企业关系和首席慈善官”的得分均高于0.9 我有200多万个名字,还有200万个姓氏(加上它们的频率计数),我想从姓氏X中间名(使用名字池)X姓氏的外部乘法中综合创建一个巨大的数据集 问题是,在我得到一个超出GC开销限制的异常之前,我甚至不需要检查所有姓氏一次(即使丢弃
public class OpenNLPNameStream implements ObjectStream<NameSample> {
private List<Map<String, Object>> firstNames = null;
private List<Map<String, Object>> lastNames = null;
private int firstNameIdx = 0;
private int firstNameCountIdx = 0;
private int middleNameIdx = 0;
private int middleNameCountIdx = 0;
private int lastNameIdx = 0;
private int lastNameCountIdx = 0;
private int firstNameMaxCount = 0;
private int middleNameMaxCount = 0;
private int lastNameMaxCount = 0;
private int firstNameKBSize = 0;
private int lastNameKBSize = 0;
Span span[] = new Span[1];
String fullName[] = new String[3];
String partialName[] = new String[2];
private void increaseFirstNameCountIdx()
{
firstNameCountIdx++;
if (firstNameCountIdx == firstNameMaxCount) {
firstNameIdx++;
if (firstNameIdx == firstNameKBSize)
return; //no need to update anything - this is the end of the run...
firstNameMaxCount = getFirstNameMaxCount(firstNameIdx);
firstNameCountIdx = 0;
}
}
private void increaseMiddleNameCountIdx()
{
lastNameCountIdx++;
if (middleNameCountIdx == middleNameMaxCount) {
if (middleNameIdx == firstNameKBSize) {
resetMiddleNameIdx();
increaseFirstNameCountIdx();
} else {
middleNameMaxCount = getMiddleNameMaxCount(middleNameIdx);
middleNameCountIdx = 0;
}
}
}
private void increaseLastNameCountIdx()
{
lastNameCountIdx++;
if (lastNameCountIdx == lastNameMaxCount) {
lastNameIdx++;
if (lastNameIdx == lastNameKBSize) {
resetLastNameIdx();
increaseMiddleNameCountIdx();
}
else {
lastNameMaxCount = getLastNameMaxCount(lastNameIdx);
lastNameCountIdx = 0;
}
}
}
private void resetLastNameIdx()
{
lastNameIdx = 0;
lastNameMaxCount = getLastNameMaxCount(0);
lastNameCountIdx = 0;
}
private void resetMiddleNameIdx()
{
middleNameIdx = 0;
middleNameMaxCount = getMiddleNameMaxCount(0);
middleNameCountIdx = 0;
}
private int getFirstNameMaxCount(int i)
{
return 1; //compromised on using just
//String occurences = (String) firstNames.get(i).get("occurences");
//return Integer.parseInt(occurences);
}
private int getMiddleNameMaxCount(int i)
{
return 3; //compromised on using just
//String occurences = (String) firstNames.get(i).get("occurences");
//return Integer.parseInt(occurences);
}
private int getLastNameMaxCount(int i)
{
return 1;
//String occurences = (String) lastNames.get(i).get("occurences");
//return Integer.parseInt(occurences);
}
@Override
public NameSample read() throws IOException {
if (firstNames == null) {
firstNames = CSVFileTools.readFileFromInputStream("namep_first_name_idf.csv", new ClassPathResource("namep_first_name_idf.csv").getInputStream());
firstNameKBSize = firstNames.size();
firstNameMaxCount = getFirstNameMaxCount(0);
middleNameMaxCount = getFirstNameMaxCount(0);
}
if (lastNames == null) {
lastNames = CSVFileTools.readFileFromInputStream("namep_last_name_idf.csv",new ClassPathResource("namep_last_name_idf.csv").getInputStream());
lastNameKBSize = lastNames.size();
lastNameMaxCount = getLastNameMaxCount(0);
}
increaseLastNameCountIdx();;
if (firstNameIdx == firstNameKBSize)
return null; //we've finished iterating over all permutations!
String [] sentence;
if (firstNameCountIdx < firstNameMaxCount / 3)
{
span[0] = new Span(0,2,"Name");
sentence = partialName;
sentence[0] = (String)firstNames.get(firstNameIdx).get("first_name");
sentence[1] = (String)lastNames.get(lastNameIdx).get("last_name");
}
else
{
span[0] = new Span(0,3,"name");
sentence = fullName;
sentence[0] = (String)firstNames.get(firstNameIdx).get("first_name");
sentence[2] = (String)lastNames.get(lastNameIdx).get("last_name");
if (firstNameCountIdx < 2*firstNameCountIdx/3) {
sentence[1] = (String)firstNames.get(middleNameIdx).get("first_name");
}
else {
sentence[1] = ((String)firstNames.get(middleNameIdx).get("first_name")).substring(0,1) + ".";
}
}
return new NameSample(sentence,span,true);
}
@Override
public void reset() throws IOException, UnsupportedOperationException {
firstNameIdx = 0;
firstNameCountIdx = 0;
middleNameIdx = 0;
middleNameCountIdx = 0;
lastNameIdx = 0;
lastNameCountIdx = 0;
firstNameMaxCount = 0;
middleNameMaxCount = 0;
lastNameMaxCount = 0;
}
@Override
public void close() throws IOException {
reset();
firstNames = null;
lastNames = null;
}
}
运行几分钟后,我发现以下错误:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(WindowFeatureGenerator.java:112)
at opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(AggregatedFeatureGenerator.java:79)
at opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(CachedFeatureGenerator.java:69)
at opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameContextGenerator.java:118)
at opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameContextGenerator.java:37)
at opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEventStream.java:113)
at opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventStream.java:137)
at opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventStream.java:36)
at opennlp.tools.util.AbstractEventStream.read(AbstractEventStream.java:62)
at opennlp.tools.util.AbstractEventStream.read(AbstractEventStream.java:27)
at opennlp.tools.util.AbstractObjectStream.read(AbstractObjectStream.java:32)
at opennlp.tools.ml.model.HashSumEventStream.read(HashSumEventStream.java:46)
at opennlp.tools.ml.model.HashSumEventStream.read(HashSumEventStream.java:29)
at opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:130)
at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:83)
at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)
java.lang.OutOfMemoryError:超出GC开销限制
在opennlp.tools.util.featuregen.WindowFeatureGenerator.createFeatures(WindowFeatureGenerator.java:112)中
在opennlp.tools.util.featuregen.AggregatedFeatureGenerator.createFeatures(AggregatedFeatureGenerator.java:79)中
在opennlp.tools.util.featuregen.CachedFeatureGenerator.createFeatures(CachedFeatureGenerator.java:69)上
在opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameContextGenerator.java:118)中
在opennlp.tools.namefind.DefaultNameContextGenerator.getContext(DefaultNameContextGenerator.java:37)
位于opennlp.tools.namefind.NameFinderEventStream.generateEvents(NameFinderEventStream.java:113)
位于opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventStream.java:137)
位于opennlp.tools.namefind.NameFinderEventStream.createEvents(NameFinderEventStream.java:36)
位于opennlp.tools.util.AbstractEventStream.read(AbstractEventStream.java:62)
位于opennlp.tools.util.AbstractEventStream.read(AbstractEventStream.java:27)
位于opennlp.tools.util.AbstractObjectStream.read(AbstractObjectStream.java:32)
在opennlp.tools.ml.model.HashSumEventStream.read(HashSumEventStream.java:46)上
在opennlp.tools.ml.model.HashSumEventStream.read(HashSumEventStream.java:29)上
位于opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:130)
在opennlp.tools.ml.model.TwoPassDataIndexer。(TwoPassDataIndexer.java:83)
位于opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
位于opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
位于opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)
编辑:在将JVM的内存增加到8GB之后,我仍然无法通过前200万个姓氏,但现在例外是:
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:703)
at java.util.HashMap.putVal(HashMap.java:662)
at java.util.HashMap.put(HashMap.java:611)
at opennlp.tools.ml.model.AbstractDataIndexer.update(AbstractDataIndexer.java:141)
at opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:134)
at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:83)
at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)
java.lang.OutOfMemoryError:java堆空间
在java.util.HashMap.resize(HashMap.java:703)中
在java.util.HashMap.putVal(HashMap.java:662)
位于java.util.HashMap.put(HashMap.java:611)
在opennlp.tools.ml.model.AbstractDataIndexer.update(AbstractDataIndexer.java:141)上
位于opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:134)
在opennlp.tools.ml.model.TwoPassDataIndexer。(TwoPassDataIndexer.java:83)
位于opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
位于opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
位于opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)
问题似乎源于这样一个事实:我正在创建一个新的名称示例,并在每次读取调用时创建新的跨距和字符串。。。但我不能重用跨度或名称样本,因为它们是不可变的
我是否应该编写我自己的语言模型,是否有更好的Java库来做这类事情(我只想知道提取的文本实际上是一个名称的概率)是否有我应该为我正在训练的模型调整的参数
如有任何建议,将不胜感激 您使用什么内存设置?使用
-Xmx
标志为JVM提供更多内存如何?@haraldK将其增加到8GB,现在错误有点不同:在AbstractDataIndexer.update期间尝试调整HashMap大小时出现Java堆空间错误…。@Mr.shalme编写自己的语言模型会更好@谢谢,我今天写的:)你使用什么内存设置?使用-Xmx
标志为JVM提供更多内存如何?@haraldK将其增加到8GB,现在错误有点不同:在AbstractDataIndexer.update期间尝试调整HashMap大小时出现Java堆空间错误…。@Mr.shalme编写自己的语言模型会更好@谢谢,我今天写的:)
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:703)
at java.util.HashMap.putVal(HashMap.java:662)
at java.util.HashMap.put(HashMap.java:611)
at opennlp.tools.ml.model.AbstractDataIndexer.update(AbstractDataIndexer.java:141)
at opennlp.tools.ml.model.TwoPassDataIndexer.computeEventCounts(TwoPassDataIndexer.java:134)
at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:83)
at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:337)