Javascript 使用unicode字符提取字符串中的单词_Javascript_Regex_Node.js_Unicode

Javascript 使用unicode字符提取字符串中的单词

javascript regex node.js unicode

Javascript 使用unicode字符提取字符串中的单词,javascript,regex,node.js,unicode,Javascript,Regex,Node.js,Unicode,在javascript（nodejs）中，我需要使用unicode字符为文本字符串编制索引，即给定一个字符串，如： "Bonjour à tous le monde, je voulais être le premier à vous dire: -'comment ça va' -<est-ce qu'il fait beau?>" 我如何使用正则表达式或任何其他方法实现这一点 ps：我安装并尝试了xregexp模块，该模块为javascript提供unicode支持

在javascript（

nodejs

）中，我需要使用unicode字符为文本字符串编制索引，即给定一个字符串，如：

"Bonjour à tous le monde, 
je voulais être le premier à vous dire:
  -'comment ça va'
  -<est-ce qu'il fait beau?>"

我如何使用正则表达式或任何其他方法实现这一点

ps：我安装并尝试了xregexp模块，该模块为javascript提供unicode支持，但一般来说，它对正则表达式毫无用处，因此我不能走得太远。

一个想法是将字符串按不属于单词的各种字符分割，然后过滤掉空字符串：

var str = "Bonjour à tous le monde, je voulais être le premier à vous dire:  -'comment ça va'  -<est-ce qu'il fait beau?>";

var result = str.split(/[-:'"?\s><]+/).filter(function(item) { return item !== '' });
/*
["Bonjour", "à", "tous", "le", "monde,", "je", "voulais", "être", "le", "premier", "à", "vous", "ire", "comment", "ça", "va", "est", "ce", "qu", "il", "fait", "beau"]
*/

var result = str.match(/[^-:'"?\s><]+/g);

var str=“Bonjourátous le monde，je voulaisêtre le premierávous dire:-'commentça va'-您可以使用该版本（除其他外）添加对正则表达式unicode类别的支持。我们感兴趣的是类别而不是unicode字母，即\p{L}。
然后，您可以使用regexXRegExp（“\\P{L}+”）
拆分字符串
var s=“大家好，欢迎光临，欢迎光临：\n-'commentça va'\n-您可能可以使用“uwords”库。它通过将L*Unicode组中的字符组合在一起，从文本中提取单词
它的工作原理类似于XRegExp（“\\p{L}+”）
，但速度非常快
例如：
var uwords = require('uwords');
var words = uwords('Bonjour à tous le monde,\n' +
    'je voulais être le premier à vous dire:\n' +
    '-\'comment ça va\'\n' +
    '-<est-ce qu\'il fait beau?>');
console.log(words);

[ 'Bonjour',
  'à',
  'tous',
  'le',
  'monde',
  'je',
  'voulais',
  'être',
  'le',
  'premier',
  'à',
  'vous',
  'dire',
  'comment',
  'ça',
  'va',
  'est',
  'ce',
  'qu',
  'il',
  'fait',
  'beau' ]

var-uwords=require（'uwords'）；
var words=uwords（'Bonjourátous le monde，\n'+
“我是你的总理，你很可怕：\n”+
'-\'commentça va\'\n'+
“-我认为像[^\s]*
这样的正则表达式就足够了。定义“单词”并指定要处理的语言。单词边界规则严重依赖于语言，单词的概念很模糊。“est-ce”是两个单词还是一个单词？如果“qu'il”是两个单词（逻辑上是这样的话），第一个词是什么？我必须补充一点，使用XRegExp解决了这个问题，但就我所见，性能非常糟糕，我的意思是糟糕到让它毫无用处的地步：XRegExp.split（s，notALetter）即使是包含少于100个字符的字符串，也可能需要几百毫秒。因此，任何有相同问题的人都应该意识到这一点。文档说XRegExp编译为本机正则表达式，因此不会有任何性能损失，但这不是我所看到的-可能是因为unicode插件？最新版本的XRegExp非常快至少比uwords快40倍。将单词与XRegExp.match（content，XRegExp（'\\p{L}+'，'g'））匹配比在非单词上拆分更快
var s="Bonjour à tous le monde,\nje voulais être le premier à vous dire:\n  -'comment ça va'\n  -<est-ce qu'il fait beau?>";
var notALetter = XRegExp("\\P{L}+");
var words = XRegExp.split(s, notALetter);

var uwords = require('uwords');
var words = uwords('Bonjour à tous le monde,\n' +
    'je voulais être le premier à vous dire:\n' +
    '-\'comment ça va\'\n' +
    '-<est-ce qu\'il fait beau?>');
console.log(words);

[ 'Bonjour',
  'à',
  'tous',
  'le',
  'monde',
  'je',
  'voulais',
  'être',
  'le',
  'premier',
  'à',
  'vous',
  'dire',
  'comment',
  'ça',
  'va',
  'est',
  'ce',
  'qu',
  'il',
  'fait',
  'beau' ]