如何在Java中正确计算字符串的长度？_Java_String_Unicode_Character Encoding_Standards

如何在Java中正确计算字符串的长度？

java string unicode character-encoding

如何在Java中正确计算字符串的长度？,java,string,unicode,character-encoding,standards,Java,String,Unicode,Character Encoding,Standards,我知道有String#length和Character中的各种方法或多或少地对代码单元/代码点起作用在Java中，考虑到语言/区域设置、规范化和图形集群等因素，实际返回Unicode标准（）指定的结果的建议方法是什么？Java字符串长度的正常模型 String.length（）指定为返回字符串中char值（“代码单位”）的数量。这是Java字符串长度最常用的定义；见下文您基于支持数组/数组片大小的长度语义描述1不正确。length（）返回的值也是支持数组或数组片的大小，这只是典型Java类库

我知道有

String#length

和

Character

中的各种方法或多或少地对代码单元/代码点起作用

在Java中，考虑到语言/区域设置、规范化和图形集群等因素，实际返回Unicode标准（）指定的结果的建议方法是什么？

Java字符串长度的正常模型

String.length（）

指定为返回字符串中

char

值（“代码单位”）的数量。这是Java字符串长度最常用的定义；见下文

您基于支持数组/数组片大小的

长度语义描述1不正确。length（）
返回的值也是支持数组或数组片的大小，这只是典型Java类库的一个实现细节<代码>字符串
不需要以这种方式实现。事实上，我认为我见过Java字符串的实现，但它并没有以这种方式实现

字符串长度的替代模型。
要获取字符串中Unicode代码点的数量，请使用str.codePointCount（0，str.length（））
——请参阅
要获取特定编码（即字符集）中字符串的大小（以字节为单位），请使用str.getBytes（charset）.length
2
要处理特定于语言环境的问题，可以使用将字符串规范化为最适合您的用例的任何形式，然后如上所述使用codePointCount
。但在某些情况下，即使这样也行不通；e、 g.匈牙利字母计数规则，Unicode标准显然不符合这些规则

使用String.length（）通常是可以的
大多数应用程序使用String.length（）
的原因是，大多数应用程序不关心以人为中心的方式计算单词、文本等中的字符数。例如，如果我这样做：
String s = "hi mum how are you";
int pos = s.indexOf("mum");
String textAfterMum = s.substring(pos + "mum".length());

“mum.length（）
不返回代码点，或者它不是语言正确的字符计数，这其实并不重要。它使用适合于手头任务的模型测量字符串的长度。它是有效的
显然，当您进行多语言文本分析时，事情会变得更加复杂；e、 g.搜索单词。但即便如此，如果在开始之前对文本和参数进行规范化，大多数情况下可以安全地使用“代码单元”而不是“代码点”进行编码；i、 e.length（）
仍然有效

1-此描述针对问题的某些版本。查看编辑历史记录。。。如果您有足够的代表积分。

2-使用str.getBytes（charset）.length需要进行编码并将其丢弃。在没有那个副本的情况下，可能有一种通用的方法来实现这一点。它需要将字符串
包装为字符缓冲区
，创建一个自定义的字节缓冲区
，无需备份即可充当字节计数器，然后使用编码器.encode（…）
对字节进行计数。注意：我没有尝试过，我也不建议尝试，除非你有明确的证据证明getBytes（charset）
是一个显著的性能瓶颈。
如果你的意思是，根据语言的语法规则计算字符串的长度，那么答案是否定的，Java中没有这样的算法，其他地方也没有
除非算法还对文本进行完整的语义分析
例如，在匈牙利语中，sz
和zs
可以算作一个或两个字母，这取决于它们出现的单词的组成。（例如：ország
是5个字母，而torzság
是7个字母。）
Uodate：如果您只需要Unicode标准字符计数（正如我指出的，这是不准确的），那么将字符串转换为NFKC
表单可能是一个解决方案。
能够对文本进行迭代，并可以报告“字符”、单词、句子和行边界
考虑以下代码：
def length(text: String, locale: java.util.Locale = java.util.Locale.ENGLISH) = {
  val charIterator = java.text.BreakIterator.getCharacterInstance(locale)
  charIterator.setText(text)

  var result = 0
  while(charIterator.next() != BreakIterator.DONE) result += 1
  result
}

运行它：
scala> val text = "Thîs lóo̰ks we̐ird!"
text: java.lang.String = Thîs lóo̰ks we̐ird!

scala> val length = length(text)
length: Int = 17

scala> val codepoints = text.codePointCount(0, text.length)
codepoints: Int = 21 

使用代理项对：
scala> val parens = "\uDBFF\uDFFCsurpi\u0301se!\uDBFF\uDFFD"
parens: java.lang.String = It depends on exactly what you mean by "length of [the] String":


String.length() returns the number of chars in the String. This is normally only useful for programming related tasks like allocating buffers because multi-byte encoding can cause problems which means one char doesn't mean one Unicode code point.
String.codePointCount(int, int) and Character.codePointCount(CharSequence,int,int) both return the number of Unicode code points in the String
. This is normally only useful for programming related tasks that require looking at a String
 as a series of Unicode code points without needing to worry about multi-byte encoding interfering.
BreakIterator.getCharacterInstance(Locale) can be used to get the next grapheme in a String
 for the given Locale. Using this multiple times can allow you to count the number of graphemes in a String
. Since graphemes are basically letters (in most circumstances) this method is useful for getting the number of writable characters the String
 contains. Essentially this method returns approximately the same number you would get if you manually counted the number of letters in the String
, making it useful for things like sizing user interfaces and splitting Strings
 without corrupting the data.


To give you an idea of how each of the different methods can return different lengths for the exact same data, I created this class to quickly generate the lengths of the Unicode text contained within this page, which is designed to offer a comprehensive test of many different languages with non-English characters. Here is the results of executing that code after normalizing the input file in three different ways (no normalizing, NFC, NFD):

Input UTF-8 String
>>  String.length() = 3431
>>  String.codePointCount(int,int) = 3431
>>  BreakIterator.getCharacterInstance(Locale) = 3386
NFC Normalized UTF-8 String
>>  String.length() = 3431
>>  String.codePointCount(int,int) = 3431
>>  BreakIterator.getCharacterInstance(Locale) = 3386
NFD Normalized UTF-8 String
>>  String.length() = 3554
>>  String.codePointCount(int,int) = 3554
>>  BreakIterator.getCharacterInstance(Locale) = 3386

scala>val parens=“\uDBFF\uDFFCsurpi\u0301se！\uDBFF\uDFFD”
parens:java.lang.String=这完全取决于您所说的“字符串长度”的含义：

返回中的数字。这通常只对与编程相关的任务（如分配缓冲区）有用，因为多字节编码可能会导致问题，这意味着一个字节并不意味着一个字节
两者都返回字符串中的Unicode码点数。这通常仅适用于需要将字符串
视为一系列Unicode代码点而无需担心多字节编码干扰的编程相关任务

可用于获取给定的字符串中的下一个。多次使用此选项可以让您计算字符串中的图形数。由于字形基本上是字母（在大多数情况下），因此此方法对于获取字符串
包含的可写字符数非常有用。基本上，此方法返回的数字与手动计算字符串
中的字母数时得到的数字大致相同，这使得它在调整用户界面大小和拆分字符串
时非常有用，而不会损坏数据


为了让您了解每种不同的方法如何为完全相同的数据返回不同的长度，我创建了一个用于快速生成其中包含的Unicode文本长度的方法，该方法旨在提供对许多具有非英语字符的不同语言的全面测试。以下是在以三种不同方式（无规范化，）规范化输入文件后执行该代码的结果：
如您所见，即使是“外观相同”的字符串