从Delphi字符串中检测和检索代码点和代理项_Delphi_Unicode_Surrogate Pairs

从Delphi字符串中检测和检索代码点和代理项

delphi unicode

从Delphi字符串中检测和检索代码点和代理项,delphi,unicode,surrogate-pairs,Delphi,Unicode,Surrogate Pairs,我试图更好地理解代理项对和Delphi中的Unicode实现如果我在Delphi中对Unicode字符串S:='Ĥa̲V̂e'调用length（），我将返回，8 这是因为单个字符[H]、[a]、[V]和[e]的长度分别为2、3、2和1。这是因为H有一个代理，a有两个额外的代理，V有一个代理，e没有代理如果我想返回字符串中包含所有代理项的第二个元素，[à̲]，我该怎么做？我知道我需要对单个字节进行某种测试。我使用这个例程运行了一些测试 function GetFirstCodepointS

我试图更好地理解代理项对和Delphi中的Unicode实现

如果我在Delphi中对Unicode字符串S:='Ĥa̲V̂e'调用length（），我将返回，8

这是因为单个字符[H]、[a]、[V]和[e]的长度分别为2、3、2和1。这是因为H有一个代理，a有两个额外的代理，V有一个代理，e没有代理

如果我想返回字符串中包含所有代理项的第二个元素，[à̲]，我该怎么做？我知道我需要对单个字节进行某种测试。我使用这个例程运行了一些测试

function GetFirstCodepointSize(const S: UTF8String): Integer;

在中引用

但是得到了一些不寻常的结果，例如，这里有一些不同码点的长度和大小。下面是我如何生成这些表的一个片段

第一组：这对我来说是有意义的，每个代码点的大小都是两倍，但是每个都是一个字符，Delphi给我的长度只有1，完美

INPUT:      ď       GetFirstCodePointSize = 2       Length =1
INPUT:      ơ       GetFirstCodePointSize = 2       Length =1
INPUT:      ǥ       GetFirstCodePointSize = 2       Length =1

第二组：在我看来，最初的长度和代码点是相反的？我猜这是因为字符+代理被单独处理，因此第一个码点大小是“H”，它是1，但长度返回的是“H”加上“^”的长度

INPUT:      Ĥ      GetFirstCodePointSize = 1       Length =2
INPUT:      à̲     GetFirstCodePointSize = 1       Length =3
INPUT:      V̂      GetFirstCodePointSize = 1       Length =2
INPUT:      e       GetFirstCodePointSize = 1       Length =1

一些额外的测试

INPUT:      ¼       GetFirstCodePointSize = 2       Length =1
INPUT:      ₧       GetFirstCodePointSize = 3       Length =1
INPUT:      
  I am trying to better understand surrogate pairs and Unicode implementation in Delphi. 


Let's get some terminology out of the way.

Each "character" (known as a grapheme) that is defined by Unicode is assigned a unique codepoint.

In a Unicode Transformation Format (UTF) encoding - UTF-7, UTF-8, UTF-16, and UTF-32 - each codepoint is encoded as a sequence of codeunits.  The size of each codeunit is determined by the encoding - 7 bits for UTF-7, 8 bits for UTF-8, 16 bits for UTF-16, and 32 bits for UTF-32 (hence their names).

In Delphi 2009 and later, String
 is an alias for UnicodeString
, and Char
 is an alias for WideChar
.  WideChar
 is 16 bits.  A UnicodeString
 holds a UTF-16 encoded string (in earlier versions of Delphi, the equivalent string type was WideString
), and each WideChar
 is a UTF-16 codeunit.

In UTF-16, a codepoint can be encoded using either 1 or 2 codeunits.  1 codeunit can encode codepoint values in the Basic Multilingual Plane (BMP) range - $0000 to $FFFF, inclusive.  Higher codepoints require 2 codeunits, which is also known as a surrogate pair.


  If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8. 
  
  This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively.
  
  This is because Ĥ has a surrogate,  à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates. 


Yes, there are 8 WideChar
 elements (codeunits) in your UTF-16 UnicodeString
.  What you are calling "surrogates" are actually known as "combining marks".  Each combining mark is its own unique codepoint, and thus its own codeunit sequence.


  If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that?


You have to start at the beginning of the UnicodeString
 and analyze each WideChar
 until you find one that is not a combining mark attached to a previous WideChar
.  On Windows, the easiest way to do that is to use the CharNextW() function, eg:

var
  S: String;
  P: PChar;
begin
  S := 'Ĥà̲V̂e';
  P := CharNext(PChar(S)); // returns a pointer to  à̲
end;

输入：¼GetFirstCodePointSize=2长度=1
输入：₧GetFirstCodePointSize=3长度=1
输入：
我试图更好地理解代理项对和Delphi中的Unicode实现
让我们把一些术语放在一边
由Unicode定义的每个“字符”（称为字形）都被分配一个唯一的码点
在Unicode转换格式（UTF）编码（UTF-7、UTF-8、UTF-16和UTF-32）中，每个码点被编码为一个码单元序列。每个编码单元的大小由编码决定-UTF-7为7位，UTF-8为8位，UTF-16为16位，UTF-32为32位（因此它们的名称）
在Delphi 2009及更高版本中，String
是UnicodeString
的别名，Char
是WideChar
的别名WideChar
为16位。UnicodeString
保存一个UTF-16编码字符串（在早期版本的Delphi中，等效字符串类型为WideString
），每个WideChar
都是一个UTF-16编码单元
在UTF-16中，一个码点可以使用1或2个码单元进行编码。1个codeunit可以对基本多语言平面（BMP）范围内的代码点值进行编码，范围为$0000到$FFFF，包括$0000到$FFFF。较高的代码点需要2个代码单元，也称为代理项对
如果我在Delphi中对Unicode字符串S:='Ĥa̲V̂e'调用length（），我将返回，8
这是因为单个字符[H]、[a]、[V]和[e]的长度分别为2、3、2和1
这是因为H有一个代理，a有两个额外的代理，V有一个代理，e没有代理
是的，您的UTF-16UnicodeString
中有8个WideChar
元素（编码单元）。你所谓的“代理”实际上被称为“组合标记”。每个组合标记都是它自己的唯一代码点，因此也是它自己的代码单元序列
如果我想返回字符串中包含所有代理项的第二个元素，[à̲]，我该怎么做
您必须从分解的开始，分析每个WideChar
，直到找到一个不是附加在前一个WideChar
上的组合标记。在Windows上，最简单的方法是使用该功能，例如：
uses
  Character;

function MyCharNext(P: PChar): PChar;
begin
  if (P <> nil) and (P^ <> #0) then
  begin
    Result := StrNextChar(P);
    while GetUnicodeCategory(Result^) = ucCombiningMark do
      Result := StrNextChar(Result);
  end else begin
    Result := nil;
  end;
end;

var
  S: String;
  P: PChar;
begin
  S := 'Ĥà̲V̂e';
  P := MyCharNext(PChar(S)); // should return a pointer to  à̲
end;

Delphi RTL没有等效的函数。您应该手动编写一个，或者使用第三方库。RTL确实有一个StrNextChar（）
函数，但它只处理UTF-16代理，而不是组合标记（CharNext（）
同时处理这两个标记）。因此，您可以使用StrNextChar（）
扫描UnicodeString
中的每个代码点，但您必须在每个代码点处进行loo，以了解它是否是组合标记，例如：
uses
  SysUtils, Character;

function MyCharNext(P: PChar): PChar;
begin
  Result := P;
  if Result <> nil then
  begin
    Result := StrNextChar(Result);
    while GetUnicodeCategory(Result^) = ucCombiningMark do
      Result := StrNextChar(Result);
  end;
end;

function GetElementAtIndex(S: String; StrIdx : Integer): String;
var
  pStart, pEnd: PChar;
begin
  Result := '';
  if (S = '') or (StrIdx < 0) then Exit;
  pStart := PChar(S);
  while StrIdx > 1 do
  begin
    pStart := MyCharNext(pStart);
    if pStart^ = #0 then Exit; 
    Dec(StrIdx);
  end;
  pEnd := MyCharNext(pStart);
  {$POINTERMATH ON}
  SetString(Result, pStart, pEnd-pStart);
end;

使用
性格
函数MyCharNext（P:PChar）：PChar；
开始
如果（P nil）和（P^#0），那么
开始
结果：=StrNextChar（P）；
而GetUnicodeCegory（结果^）=ucCombiningMark do
结果：=StrNextChar（结果）；
结束，否则开始
结果：=无；
结束；
结束；
变量
S:字符串；
P:PChar；
开始
S:=“ĤàV̂e”；
P:=MyCharNext（PChar（S））；//应该返回指向a的指针
结束；

我知道我需要对单个字节进行某种测试
不是字节，而是它们在解码时表示的代码点
我使用这个例程运行了一些测试
function GetFirstCodepointSize(const S: UTF8String): Integer;  

函数GetFirstCodepointSize（常量S:UTF8String）：整数
仔细看看这个函数签名。查看参数类型？它是一个UTF-8字符串，而不是UTF-16字符串。这甚至在您从以下位置获得该函数的答案中有所说明：
下面是一个如何解析UTF8字符串的示例
UTF-8和UTF-16是非常不同的编码，因此具有不同的语义。不能使用UTF-8语义处理UTF-16字符串，反之亦然
Delphi中是否有可靠的方法来确定Unicode字符串中元素的起始和结束位置
不直接。必须从头开始解析字符串，根据需要跳过元素，直到到达所需的元素。请记住，每个代码点可以编码为1或2个代码单元元素，并且每个逻辑图示符可以使用多个代码点（以及多个代码单元序列）进行编码
我知道我用e这个词的术语