Java 为什么'；t字串'；s hashCode（）缓存0？_Java_String_Hashcode

Java 为什么'；t字串'；s hashCode（）缓存0？

java string

Java 为什么'；t字串'；s hashCode（）缓存0？,java,string,hashcode,Java,String,Hashcode,我注意到在字符串的Java6源代码中，hashCode只缓存0以外的值。以下代码片段显示了性能上的差异： public class Main{ static void test(String s) { long start = System.currentTimeMillis(); for (int i = 0; i < 10000000; i++) { s.hashCode(); } System.out.forma

我注意到在字符串的Java6源代码中，hashCode只缓存0以外的值。以下代码片段显示了性能上的差异：

public class Main{
   static void test(String s) {
      long start = System.currentTimeMillis();
      for (int i = 0; i < 10000000; i++) {
         s.hashCode();
      }
      System.out.format("Took %d ms.%n", System.currentTimeMillis() - start);
   }
   public static void main(String[] args) {
      String z = "Allocator redistricts; strict allocator redistricts strictly.";
      test(z);
      test(z.toUpperCase());
   }
}

因此，我的问题是：

为什么字符串的hashCode（）不能缓存0
Java字符串散列为0的概率是多少
避免每次为散列为0的字符串重新计算散列值的性能损失的最佳方法是什么
这是缓存值的最佳实践方式吗？（即缓存除一个之外的所有内容？）

有趣的是，这里的每一行都是一个散列为0的字符串：

pollinating sandboxes
amusement & hemophilias
schoolworks = perversive
electrolysissweeteners.net
constitutionalunstableness.net
grinnerslaphappier.org
BLEACHINGFEMININELY.NET
WWW.BUMRACEGOERS.ORG
WWW.RACCOONPRUDENTIALS.NET
Microcomputers: the unredeemed lollipop...
Incentively, my dear, I don't tessellate a derangement.
A person who never yodelled an apology, never preened vocalizing transsexuals.

它使用0表示“我还没有计算出hashcode”。另一种方法是使用单独的布尔标志，这将占用更多内存。（当然，也可以不缓存哈希代码。）

我不希望有很多字符串散列为0；可以说，哈希例程有意避免0是有意义的（例如，将哈希值0转换为1，并缓存该值）。这将增加碰撞，但避免重新灰化。但是现在这样做已经太迟了，因为字符串哈希代码算法已经被明确地记录下来了

至于这在总体上是否是一个好主意：它确实是一种高效的缓存机制，而且（请参见编辑）如果进行更改，可能会更好，以避免对值进行重新灰化，从而导致散列值为0。就我个人而言，我很想看到一些数据，这些数据让Sun相信这是值得一做的——它会为每一个创建的字符串占用额外的4字节，不管它被散列的频率是多少，唯一的好处是对于多次散列的字符串

编辑：正如KevinB在其他地方的评论中指出的，上面的“避免0”建议可能会有净成本，因为它有助于一种非常罕见的情况，但需要对每个哈希计算进行额外的比较。

0不会被缓存，因为实现将缓存值0解释为“缓存值尚未初始化”。另一种选择是使用

java.lang.Integer

，其中null表示尚未缓存该值。然而，这将意味着额外的存储开销

关于字符串的散列码被计算为0的概率，我认为概率很低，在以下情况下可能发生：

字符串为空（尽管每次重新计算此哈希代码实际上是O（1））

发生溢出时，最终计算的哈希代码为0（

，例如Integer.MAX_值+h（c1）+h（c2）+…h（cn）==0

）

该字符串仅包含Unicode字符0。非常不可能，因为这是一个除了在“纸带世界”（！）中没有意义的控制字符：

发件人：

代码0（ASCII代码名NUL）是一个特例。在纸带中，它是没有洞的情况下。它是将其视为填充物很方便字符无其他含义

为什么字符串的hashCode（）不能缓存0

值0保留为“哈希代码未缓存”

Java字符串散列为0的概率是多少

根据Javadoc，字符串哈希代码的公式是：

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

使用

int

算术，其中

s[i]

是字符串的第i个字符，

是字符串的长度。（作为特殊情况，空字符串的哈希值定义为零。）

我的直觉是，如上所述的hashcode函数在

int

值的范围内提供了一个统一的字符串散列值分布。均匀排列意味着随机生成的字符串散列为零的概率为1/2^32

避免每次为散列为0的字符串重新计算散列值的性能损失的最佳方法是什么

最好的策略是忽略这个问题。如果重复对同一字符串值进行哈希运算，则算法有点奇怪

这是缓存值的最佳实践方式吗？（即缓存除一个之外的所有内容？）

这是一种空间与时间的权衡。好的，备选方案是：

为每个String对象添加一个
```
cached
```
标志，使每个Java字符串都有一个额外的单词
使用
```
散列
```
成员的顶部位作为缓存标志。这样可以缓存所有散列值，但只有可能的字符串散列值的一半
根本不要在字符串上缓存哈希代码

我认为Java设计人员对字符串的要求是正确的，我确信他们已经做了大量的分析，证实了他们决定的正确性。但是，这并不意味着这始终是处理缓存的最佳方式

（请注意，有两个“公共”字符串值散列为零：空字符串和仅由NUL字符组成的字符串。但是，与计算典型字符串值的散列代码相比，计算这些值的散列代码的成本很小。）

您不必担心任何问题。这里有一个思考这个问题的方法

假设您有一个应用程序，它一年四季都在散列字符串。假设它需要1000个字符串，全部在内存中，以循环方式重复调用hashCode（），经过一百万次，然后再获得1000个新字符串，然后再次执行

假设一个字符串的散列码为零的可能性实际上远远大于1/2^32。我肯定它比1/2^32大一些，但假设它比1/2^16更糟糕（平方根！现在更糟糕了！）

在这种情况下，Oracle的工程师改进了这些字符串的哈希代码的缓存方式，这比其他任何人都能让您受益匪浅。那么你呢

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

 public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

  if (h == 0 && value.length > 0) ...

java.lang.String object internals:
 OFFSET  SIZE     TYPE DESCRIPTION                           VALUE
  0    12          (object header)                           N/A
 12     4   char[] String.value                              N/A
 16     4      int String.hash                               N/A
 20     4          (loss due to the next object alignment)
 Instance size: 24 bytes
 Space losses: 0 bytes internal + 4 bytes external = 4 bytes total

Space losses : ... 4 bytes total.

java.lang.String object internals:
 OFFSET  SIZE     TYPE DESCRIPTION                           VALUE
  0    12          (object header)                           N/A
 12     4   byte[] String.value                              N/A
 16     4      int String.hash                               N/A
 20     1     byte String.coder                              N/A
 21     3          (loss due to the next object alignment)
 Instance size: 24 bytes
 Space losses: 0 bytes internal + 3 bytes external = 3 bytes total

java.lang.String object internals:
OFFSET  SIZE      TYPE DESCRIPTION                            VALUE
  0    12           (object header)                           N/A
 12     4    byte[] String.value                              N/A
 16     4       int String.hash                               N/A
 20     1      byte String.coder                              N/A
 21     1   boolean String.hashIsZero                         N/A
 22     2           (loss due to the next object alignment)
 Instance size: 24 bytes
 Space losses: 0 bytes internal + 2 bytes external = 2 bytes total

public int hashCode() {
    int h = hash;
    if (h == 0 && !hashIsZero) {
        h = isLatin1() ? StringLatin1.hashCode(value)
                       : StringUTF16.hashCode(value);
        if (h == 0) {
            hashIsZero = true;
        } else {
            hash = h;
        }
    }
    return h;
}

    @Override
    public int hashCode(){
        if(!hashCodeComputed){
            // or any other sane computation
            hash = 42;
            hashCodeComputed = true;
        }

        return hash;
    }

    // The hash or hashIsZero fields are subject to a benign data race,
    // making it crucial to ensure that any observable result of the
    // calculation in this method stays correct under any possible read of
    // these fields. Necessary restrictions to allow this to be correct
    // without explicit memory fences or similar concurrency primitives is
    // that we can ever only write to one of these two fields for a given
    // String instance, and that the computation is idempotent and derived
    // from immutable state