Java 在不转换为字节[]的情况下，获取以字节为单位编码的字符串的大小_Java_String_Size_Byte

Java 在不转换为字节[]的情况下，获取以字节为单位编码的字符串的大小

java string

Java 在不转换为字节[]的情况下，获取以字节为单位编码的字符串的大小,java,string,size,byte,Java,String,Size,Byte,我需要知道字符串/编码对的大小（以字节为单位），但不能使用getBytes（）方法，因为1）字符串非常大，在字节[]数组中复制字符串将占用大量内存，但更重要的是2）getBytes（）根据字符串的长度分配字节[]数组*每个字符可能的最大字节数。因此，如果我有一个带有1.5B字符和UTF-16编码的String，getBytes（）将尝试分配一个3GB数组，但失败了，因为数组被限制为2^32-X字节（X是特定于Java版本的）那么-是否有某种方法可以直接从字符串对象计算字符串/编码对的字节大小

我需要知道

字符串

/编码对的大小（以字节为单位），但不能使用

getBytes（）

方法，因为1）字符串非常大，在

字节[]

数组中复制

字符串

将占用大量内存，但更重要的是2）

getBytes（）

根据

字符串的长度分配字节[]
数组

*每个字符可能的最大字节数。因此，如果我有一个带有1.5B字符和UTF-16编码的

String

，

getBytes（）

将尝试分配一个3GB数组，但失败了，因为数组被限制为2^32-X字节（X是特定于Java版本的）

那么-是否有某种方法可以直接从

字符串

对象计算

字符串

/编码对的字节大小

更新：

以下是jtahlborn答案的一个有效实现：

private class CountingOutputStream extends OutputStream {
    int total;

    @Override
    public void write(int i) {
        throw new RuntimeException("don't use");
    }
    @Override
    public void write(byte[] b) {
        total += b.length;
    }

    @Override public void write(byte[] b, int offset, int len) {
        total += len;
    }
}

好吧，这太恶心了。我承认这一点，但是JVM隐藏了这些东西，所以我们必须稍微挖掘一下。还有一点汗

首先，我们需要实际的char[]，它支持一个字符串，而不进行复制。为此，我们必须使用反射来获取“值”字段：

char[] chars = null;
for (Field field : String.class.getDeclaredFields()) {
    if ("value".equals(field.getName())) {
        field.setAccessible(true);
        chars = (char[]) field.get(string); // <--- got it!
        break;
    }
}

忽略所有的getter，实现所有的put方法，如

put（byte）

和

putChar（char）

等。在

put（byte）

的内部，将长度增加1，在

put（byte[]）的内部，将长度增加数组长度。了解了？所有的东西，你把它的大小加在长度上。但您并没有在字节缓冲区中存储任何内容，您只是在计算并丢弃，所以不会占用任何空间。如果您中断put
方法，您可能会知道实际需要实现哪些方法<例如，可能不使用code>putFloat（float）

现在是大结局，将所有内容放在一起：
MyByteBuffer bbuf = new MyByteBuffer();         // your "counting" buffer
CharBuffer cbuf = CharBuffer.wrap(chars);       // wrap your char array
Charset charset = Charset.forName("UTF-8");     // your charset goes here
CharsetEncoder encoder = charset.newEncoder();  // make a new encoder
encoder.encode(cbuf, bbuf, true);               // do it!
System.out.printf("Length: %d\n", bbuf.length); // pay me US$1,000,000

下面是一个显然有效的实现：
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

public class TestUnicode {

    private final static int ENCODE_CHUNK = 100;

    public static long bytesRequiredToEncode(final String s,
            final Charset encoding) {
        long count = 0;
        for (int i = 0; i < s.length(); ) {
            int end = i + ENCODE_CHUNK;
            if (end >= s.length()) {
                end = s.length();
            } else if (Character.isHighSurrogate(s.charAt(end))) {
                end++;
            }
            count += encoding.encode(s.substring(i, end)).remaining() + 1;
            i = end;
        }
        return count;
    }

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 100; i++) {
            sb.appendCodePoint(11614);
            sb.appendCodePoint(1061122);
            sb.appendCodePoint(2065);
            sb.appendCodePoint(1064124);
        }
        Charset cs = StandardCharsets.UTF_8;

        System.out.println(bytesRequiredToEncode(new String(sb), cs));
        System.out.println(new String(sb).getBytes(cs).length);
    }
}

实际上，我会将ENCODE\u CHUNK
增加到10个字符左右
可能比brettw的答案效率稍低，但实现起来更简单。
简单，只需将其写入虚拟输出流即可：
class CountingOutputStream extends OutputStream {
  private int _total;

  @Override public void write(int b) {
    ++_total;
  }

  @Override public void write(byte[] b) {
    _total += b.length;
  }

  @Override public void write(byte[] b, int offset, int len) {
    _total += len;
  }

  public int getTotalSize(){
     _total;
  }
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
  int end = Math.min(myString.length(), i+8096);
  writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

类CountingOutputStream扩展了OutputStream{
私人国际单位总数；
@重写公共无效写入（int b）{
++_总数；
}
@重写公共无效写入（字节[]b）{
_总长度+=b.长度；
}
@重写公共无效写入（字节[]b，整数偏移量，整数长度）{
_总+=len；
}
public int getTotalize（）{
_总数；
}
}
CountingOutputStream cos=新的CountingOutputStream（）；
Writer-Writer=newoutputstreamwriter（cos，“my_编码”）；
//writer.write（myString）；
//更新：OutputStreamWriter对整个输入字符串进行简单复制，以避免使用：
对于（int i=0；i

它不仅简单，而且可能与其他“复杂”答案一样快。
使用apache commons库也一样：
public static long stringLength(String string, Charset charset) {

    try (NullOutputStream nul = new NullOutputStream();
         CountingOutputStream count = new CountingOutputStream(nul)) {

        IOUtils.write(string, count, charset.name());
        count.flush();
        return count.getCount();
    } catch (IOException e) {
        throw new IllegalStateException("Unexpected I/O.", e);
    }
}

Guava有一个这样的实施方案：
字节长度取决于目标编码。例如，“test”.getBytes（“UTF-8”）是4个字节，但“test”.getBytes（“UTF-16”）是10个字节（是的，10个，试试看）。所以你需要澄清一下你的问题。我想补充一点，这也取决于你编码的代码点（“字符”）。例如，在UTF-16中，某些代码点使用1个代码单元，其他代码点使用2个（一个代码单元的长度为16位）。UTF-8每个字符可以占用1到4个字节。@brettw如果我太密集，很抱歉，但是是的，您的评论是问题的关键：给定一个字符串和一个编码，编码字符串需要多少字节？重读这个问题，我觉得很清楚-你有什么重新编写的建议吗？@Francis以上的评论也适用于你的评论，就我所知。getByte
不会创建一个比它需要的更大的数组。它为给定字符串创建大小正确的数组。它不会创建长度为“字符串长度*每个字符的最大可能字节数”的数组。而string.length（）
不返回字符串中的字符数，而是返回代码单位数。对于UTF-16，一个代码单元是16位，每个字符的代码单元数是1或2，这取决于字符。因此，要么我不理解你问题中的第二点，要么你的假设不正确。你可以通过使用字符串本身调用来避免丑陋的反射。它将使用字符串中的char[]
，而无需复制（至少在Oracle JDK 7 Update 21中是这样）。哦，太好了！我不知道。正如@JoachimSauer很久以前所说的，没有必要使用这种反射黑客，那么为什么这个答案仍然从它开始呢？从Java9开始，这将失败，因为内部数组不是一个char[]
（抛开之前失败的替代JRE实现）。除此之外，循环使用getDeclaredFields（）
而不是只调用getDeclaredField（“value”）
是很奇怪的，但无论如何。答案的主要思想是在应用程序中创建ByteBuffer
的子类，这是不可能的。@elhefe-您的版本可能会编译，但它是不正确的。你不想在计算中使用偏移量。哎呀，修正了。显然，我的测试只使用了write（byte[]）方法。@AminSuzani-将\u total
更改为long就足够了。我不确定这里保存了什么。在尝试之前，OutputStreamWriter（通过streamncoder.write（（String str，int off，int len）方法）仍将字符串复制到字符数组中
class CountingOutputStream extends OutputStream {
  private int _total;

  @Override public void write(int b) {
    ++_total;
  }

  @Override public void write(byte[] b) {
    _total += b.length;
  }

  @Override public void write(byte[] b, int offset, int len) {
    _total += len;
  }

  public int getTotalSize(){
     _total;
  }
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
  int end = Math.min(myString.length(), i+8096);
  writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

public static long stringLength(String string, Charset charset) {

    try (NullOutputStream nul = new NullOutputStream();
         CountingOutputStream count = new CountingOutputStream(nul)) {

        IOUtils.write(string, count, charset.name());
        count.flush();
        return count.getCount();
    } catch (IOException e) {
        throw new IllegalStateException("Unexpected I/O.", e);
    }
}