Javascript 获取包含超过0xffff的unicode字符的字符串长度_Javascript_Unicode

Javascript 获取包含超过0xffff的unicode字符的字符串长度

javascript unicode

Javascript 获取包含超过0xffff的unicode字符的字符串长度,javascript,unicode,Javascript,Unicode,我用这个字符double sharp”来概括我的评论：那只是那根绳子的长度有些角色还涉及其他角色，即使它看起来像单个角色“̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉d̉̉“。长度==24 从中，它们具有返回正确长度的函数：函数fancyCount（str）{ const joiner=“\u{200D}”； const split=str.split（细木工）；让计数=0；用于（拆分常数）{ //移除变体选择器 const num=Array.from（s.split（/[\ufe00

我用这个字符double sharp

”来概括我的评论：
那只是那根绳子的长度
有些角色还涉及其他角色，即使它看起来像单个角色<代码>“̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉d̉̉“。长度==24

从中，它们具有返回正确长度的函数：

函数fancyCount（str）{
const joiner=“\u{200D}”；
const split=str.split（细木工）；
让计数=0；
用于（拆分常数）{
//移除变体选择器
const num=Array.from（s.split（/[\ufe00-\ufe0f]/）.join（“”）。长度；
计数+=num；
}
//假设正确使用了连接件
返回计数/拆分长度；
}
log（fancyCount（“FString.prototype.codes=function（）{return[…this].length}）；
String.prototype.chars=函数（）{
设GraphemeSplitter=require（'grapheme-splitter'）；
return（new GraphemeSplitter（））.countGraphemes（this）；
}
log（“FJavascript（和Java）字符串使用UTF-16编码
Unicode编码点U+0046（F
）在UTF-16中使用1个编码单元进行编码：0x0046

Unicode码点U+1D12A（这是我编写的用于获取码点长度中字符串长度的函数
函数nbUnicodeLength（字符串）{
var-stringIndex=0；
var unicodeIndex=0；
var length=string.length；
var第二；
var优先；
while（stringIndex=0xD800&&first stringIndex+1）{
second=string.charCodeAt（stringIndex+1）；
如果（秒>=0xDC00&&second“̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉̉̉̉d̉。长度==24
-有些字符比预期的长，这取决于您在寻找什么。在Javascript中，字符串由一系列16位字符“旧”unicode字符组成。因此0xffff上方的unicode代码点被编码为UCS-2，带有“代理。所以有两个旧的Unicode字符。新的Unicode支持代码点到10FFFF，所以我们有UTF-16，我们应该可以将字符作为代码点。[不考虑组合字符和符号的一般计数]@GiacomoCatenazzi:“0xffff以上的Unicode代码点编码为UCS-2”不，它们被编码为UTF-16。UCS-2先于UTF 16，不支持代码点> U+FFF，这就是为什么创建UTF 16。@ ReMyLeBeA:它取决于观点。UTF 16应该考虑一致性，只是每个代码点一个字符。许多语言早于UTF 16，所以它们用UCS-2编码。在UCS-2上，你有“代理”。（但官方不支持BMP之外的代码点，这是UTF-16设计上的一个兼容性把戏：字节与UCS-2兼容）。UTF-16代理不存在。代码太多了。console.log（[…”Ffor（let i=0；i<0x110000；i++）{let c=String.fromCodePoint（i）；console.log（[…c].length，c）；}
Top notch！正如您所见，我引用了其他人的发现。请随意发布答案。这是一个相当优雅的解决方案，但在少数情况下不起作用 : — console.log（[…”❤️“].length）；==2- console.log（[…”"FAlbizia，我不感谢你在评论中修改任务。澄清由Remy Lebeau完成。Remy Lebeau，以代码单位计算是错误的，从1993年开始，我不知道你想在这里实现什么。请熟悉Unicode标准。daxim，在我最初的问题中，我不必处理那个特定的问题。它是这只是从那篇博文中得到的另一个答案。
String.prototype.codes = function() { return [...this].length };
String.prototype.chars = function() {
    let GraphemeSplitter = require('grapheme-splitter');
    return (new GraphemeSplitter()).countGraphemes(this);
}

console.log("FJavascript (and Java) strings use UTF-16 encoding.

Unicode codepoint U+0046 (F
) is encoded in UTF-16 using 1 codeunit: 0x0046


Unicode codepoint U+1D12A (That's the function I wrote to get string length in codepoint length
function nbUnicodeLength(string){
    var stringIndex = 0;
    var unicodeIndex = 0;
    var length = string.length;
    var second;
    var first;
    while (stringIndex < length) {

        first = string.charCodeAt(stringIndex);  // returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
        if (first >= 0xD800 && first <= 0xDBFF && string.length > stringIndex + 1) {
            second = string.charCodeAt(stringIndex + 1);
            if (second >= 0xDC00 && second <= 0xDFFF) {
                stringIndex += 2;
            } else {
                stringIndex += 1;
            }
        } else {
            stringIndex += 1;
        }

        unicodeIndex += 1;
    }
    return unicodeIndex;
}