Nlp 从复杂(混合)句子中提取简单句子的算法?
是否有一种算法可以用来从段落中提取简单的句子 我的最终目标是稍后对生成的简单句子运行另一个算法,以确定作者的情绪 我从Chae Deug Park等来源对此进行了研究,但没有人讨论准备简单的句子作为训练数据Nlp 从复杂(混合)句子中提取简单句子的算法?,nlp,extraction,text-mining,text-extraction,information-extraction,Nlp,Extraction,Text Mining,Text Extraction,Information Extraction,是否有一种算法可以用来从段落中提取简单的句子 我的最终目标是稍后对生成的简单句子运行另一个算法,以确定作者的情绪 我从Chae Deug Park等来源对此进行了研究,但没有人讨论准备简单的句子作为训练数据 提前感谢看一下,它有一个句子检测器模块。文档中有如何从命令行和API使用它的示例。我刚刚使用openNLP实现了同样的功能 public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws
提前感谢看一下,它有一个句子检测器模块。文档中有如何从命令行和API使用它的示例。我刚刚使用openNLP实现了同样的功能
public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException,
InvalidFormatException {
InputStream is = new FileInputStream("resources/models/en-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);
String[] sentDetect = sdetector.sentDetect(paragraph);
is.close();
return Arrays.asList(sentDetect);
}
只有当有人为错误时,它才失败。例如,“Dr.”缩写词应大写D,两个句子之间至少应有一个空格
您还可以通过以下方式实现它:
public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){
List<String> sentences = new ArrayList<String>();
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(paragraph);
while (reMatcher.find()) {
sentences.add(reMatcher.group());
}
return sentences;
}
但失误率相当高。另一种方法是使用BreakIterator
public static List<String> breakIntoSentencesBreakIterator(String paragraph){
List<String> sentences = new ArrayList<String>();
BreakIterator sentenceIterator =
BreakIterator.getSentenceInstance(Locale.ENGLISH);
BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance();
sentenceInstance.setText(paragraph);
int end = sentenceInstance.last();
for (int start = sentenceInstance.previous();
start != BreakIterator.DONE;
end = start, start = sentenceInstance.previous()) {
sentences.add(paragraph.substring(start,end));
}
return sentences;
}
基准测试:
- 定制回复:7毫秒
- 中断迭代器:143毫秒
- openNlp:255毫秒
nlp
的问题。对于没有参与NLP的读者,我想@JohnRambo可以提供一个到定义的链接(例如)
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Mr., mrs.
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at U.S.
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
public static List<String> breakIntoSentencesBreakIterator(String paragraph){
List<String> sentences = new ArrayList<String>();
BreakIterator sentenceIterator =
BreakIterator.getSentenceInstance(Locale.ENGLISH);
BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance();
sentenceInstance.setText(paragraph);
int end = sentenceInstance.last();
for (int start = sentenceInstance.previous();
start != BreakIterator.DONE;
end = start, start = sentenceInstance.previous()) {
sentences.add(paragraph.substring(start,end));
}
return sentences;
}
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Mr.
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));