JavaScript首先将markdown拆分为标题,然后再拆分为句子

JavaScript首先将markdown拆分为标题,然后再拆分为句子,javascript,regex,Javascript,Regex,我想像下面这样把一个降价文本分成几个部分,先是标题,然后是句子 # Heading some text including multiple sentences... ## another heading some text including multiple sentences.... ## ... 进入: # Heading sent1 ----- sent2 ----- .... ---- ## another heading sent1 ---- sent2 ---- .... -

我想像下面这样把一个降价文本分成几个部分,先是标题,然后是句子

# Heading
some text including multiple sentences...
## another heading
some text including multiple sentences....
## ...
进入:

# Heading
sent1 
-----
sent2
-----
....
----
## another heading
sent1
----
sent2
----
....
----
## ...
这就是我所尝试的:

var HReg = new RegExp(/^(#{1,6}\s)(.*)/, 'gm');
var SentReg = new RegExp(/\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/, 'g');


var res1 = text.replace(HReg, function (m, g1, g2) {
    return g1 + g2 + "\r";
});

result = res1.replace(SentReg, function (m, g1, g2) {
    return g1 ? g1 : g2 + "\r"; // it's for ignoring abbreviations.
});

arr = result.split('\r');

但它将一些标题与第一句分开,或者在前一句中加入另一个标题

这绝不是建议使用合适的解析器的最佳选择,但这里有一个正则表达式,它可以很好地用作POC:

var s = `# Heading
some text, including multiple sentences. some text including multiple sentences! some text including multiple sentences?
## another heading
some text including multiple sentences. some text including multiple sentences! some text including multiple sentences?
## ABC
some text including multiple sentences. some text including multiple sentences! some text including multiple sentences?
`;

var result = s.match(/(#+.*)|([^!?;.\n]+.)/g).map(v=>v.trim())

0: "# Heading"
1: "some text, including multiple sentences."
2: "some text including multiple sentences!"
3: "some text including multiple sentences?"
4: "## another heading"
5: "some text including multiple sentences."
6: "some text including multiple sentences!"
7: "some text including multiple sentences?"
8: "## ABC"
9: "some text including multiple sentences."
10: "some text including multiple sentences!"
11: "some text including multiple sentences?"

你可以删除;如果要将其作为句子块的一部分,请从[]开始。当然,这并不能保护你免受任何决定不使用标点符号的人的伤害

当正则表达式是你的锤子时,一切看起来都像拇指。在我看来,最好使用一个真正的降价解析器来完成所有这些工作。撇开这一点不谈:最好逐行记录当前的标题深度并处理每个部分,直到达到上一级或下一级,使整个过程递归且相当简单;当使用构造函数new RegExp时,第一个参数是字符串,而不是regex,所以var HReg=new RegExp'^{1,6}\s.*',gm';或者使用var HReg=/^{1,6}\s.*/gm;我自己的句子切分解决方案考虑了缩写!此外,我需要每个标题都有它的第一句话。无论如何,谢谢,我可能想把它们分开,你的解决方案有一个想法。