Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/github/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
带占位符的JavaScript字符串标记器正则表达式_Javascript_Regex_Tokenize - Fatal编程技术网

带占位符的JavaScript字符串标记器正则表达式

带占位符的JavaScript字符串标记器正则表达式,javascript,regex,tokenize,Javascript,Regex,Tokenize,I一个标记器函数,它接受一个字符串、一个用于split的正则表达式模式,以及一个要防止标记化的正则表达式模式的任意列表。为了实现这一点,我使用占位符\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu function tokenize(str,default_pattern,protected_patterns) { const screen = new RegExp('(

I一个标记器函数,它接受一个字符串、一个用于
split
的正则表达式模式,以及一个要防止标记化的正则表达式模式的任意列表。为了实现这一点,我使用占位符
\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

function tokenize(str,default_pattern,protected_patterns) {
       const screen = new RegExp('(?:' + protected_patterns.map(s => '(?:' + s + ')').join('|') + ')', "gi");
       var screened = [];
       str = str.replace(screen, s => {
       var i = screened.push(s) - 1;
       return '____SSS____' + i + '____SSS____'; // chose a non-separator as screener, so that these placeholders don't get split.
      });
      res = str.split(default_pattern).map(s => s.replace(/____SSS____(\d+)____SSS____/, (_, i) => screened[i]))
      return res;
    }
举例来说,如果我想防止模式
yo-ho
被拆分,我将执行以下操作:

tokenize("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS____]+/i, ["\\byo-ho\\b"])
(8) ["Podia", "ser", "yo-ho", "mi", "amor", "ahora", "ya", "acabó"]
当然,我必须在正则表达式中添加占位符格式
\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu(\d+)\uuuuuuu

patterns("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/i, ["\\byo-ho\\b"])
(9) ["Podia", "ser", "SSS", "SSS", "mi", "amor", "ahora", "ya", "acabó"]
现在,对于不同的语言,我可能有不同的分割规则,比如

{
    "es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/,
    "fr" : /[^a-z0-9äâàéèëêïîöôùüûœç]+/i
}
我想动态地将
\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

 {
      "es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS___]+/,
      "fr" :  /[^a-z0-9äâàéèëêïîöôùüûœç____SSS____(\d+)____SSS___]+/i
 }

这将使具有受保护模式的
标记器正常工作。

您可以简单地捕获现有的拆分规则,如下所示:
(.+)(\].*)

并将占位符附加在第一个和第二个捕获组之间