带占位符的JavaScript字符串标记器正则表达式
I一个标记器函数,它接受一个字符串、一个用于带占位符的JavaScript字符串标记器正则表达式,javascript,regex,tokenize,Javascript,Regex,Tokenize,I一个标记器函数,它接受一个字符串、一个用于split的正则表达式模式,以及一个要防止标记化的正则表达式模式的任意列表。为了实现这一点,我使用占位符\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu function tokenize(str,default_pattern,protected_patterns) { const screen = new RegExp('(
split
的正则表达式模式,以及一个要防止标记化的正则表达式模式的任意列表。为了实现这一点,我使用占位符\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
function tokenize(str,default_pattern,protected_patterns) {
const screen = new RegExp('(?:' + protected_patterns.map(s => '(?:' + s + ')').join('|') + ')', "gi");
var screened = [];
str = str.replace(screen, s => {
var i = screened.push(s) - 1;
return '____SSS____' + i + '____SSS____'; // chose a non-separator as screener, so that these placeholders don't get split.
});
res = str.split(default_pattern).map(s => s.replace(/____SSS____(\d+)____SSS____/, (_, i) => screened[i]))
return res;
}
举例来说,如果我想防止模式yo-ho
被拆分,我将执行以下操作:
tokenize("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS____]+/i, ["\\byo-ho\\b"])
(8) ["Podia", "ser", "yo-ho", "mi", "amor", "ahora", "ya", "acabó"]
当然,我必须在正则表达式中添加占位符格式\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu(\d+)\uuuuuuu
patterns("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/i, ["\\byo-ho\\b"])
(9) ["Podia", "ser", "SSS", "SSS", "mi", "amor", "ahora", "ya", "acabó"]
现在,对于不同的语言,我可能有不同的分割规则,比如
{
"es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/,
"fr" : /[^a-z0-9äâàéèëêïîöôùüûœç]+/i
}
我想动态地将\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
{
"es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS___]+/,
"fr" : /[^a-z0-9äâàéèëêïîöôùüûœç____SSS____(\d+)____SSS___]+/i
}
这将使具有受保护模式的标记器正常工作。您可以简单地捕获现有的拆分规则,如下所示:
(.+)(\].*)
并将占位符附加在第一个和第二个捕获组之间