Java 转换；php unicode“；刻画_Java_Php

Java 转换；php unicode“；刻画

java php

Java 转换；php unicode“；刻画,java,php,Java,Php,如何通过Java将所谓的“php unicode”（）转换为普通字符？示例\xEF\xBC\xA1->A.jdk中是否有任何嵌入式方法，或者我是否应该使用regex进行此转换？首先需要将字符串中的字节提取到字节数组中，而不进行更改，然后将字节数组解码为UTF-8字符串将字符串放入字节数组的最简单方法是使用ISO-8859-1对其进行编码，ISO-8859-1将unicode值小于256的每个字符映射到具有相同值（或等效负值）的字节编辑上面的代码将UTF-8转换为Unicode字符。如果你想

如何通过Java将所谓的“php unicode”（）转换为普通字符？示例\xEF\xBC\xA1->A.jdk中是否有任何嵌入式方法，或者我是否应该使用regex进行此转换？

首先需要将字符串中的字节提取到字节数组中，而不进行更改，然后将字节数组解码为UTF-8字符串

将字符串放入字节数组的最简单方法是使用ISO-8859-1对其进行编码，ISO-8859-1将unicode值小于256的每个字符映射到具有相同值（或等效负值）的字节

编辑
上面的代码将UTF-8转换为Unicode字符。如果你想把它转换成一个合理的ASCII等价物，没有标准的方法：但是

编辑
我假设您有一个包含与UTF-8序列具有相同序号值的字符的字符串，但您指出您的字符串实际上包含转义序列，如中所示：

String phpUnicode = "\\xEF\\xBC\\xA1";

JDK没有任何内置的方法来转换这样的字符串，因此您需要使用自己的正则表达式。由于我们最终希望将utf-8字节序列转换为字符串，因此我们需要设置一个字节数组，可能需要使用：

Pattern oneChar = Pattern.compile("\\\\x([0-9A-F]{2})|(.)", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = oneChar.matcher(phpUnicode);
ByteArrayOutputStream bytes = new ByteArrayOutputStream();

while (matcher.find()) {
    int ch;
    if (matcher.group(1) == null) {
        ch = matcher.group(2).charAt(0);
    }
    else {
        ch = Integer.parseInt(matcher.group(1), 16);
    }
    bytes.write((int) ch);
}
String javaString = new String(bytes.toByteArray(), "UTF-8");
System.out.println(javaString);

这将通过转换\xAB序列生成UTF-8流。然后将这个UTF-8流转换为Java字符串。需要注意的是，任何不属于转义序列的字符都将被转换为相当于unicode字符低位8位的字节。这对于ascii很好，但可能会导致非ascii字符的转码问题

@麦克道尔：
顺序如下：

String phpUnicode = "\u00EF\u00BC\u00A1"
byte[] bytes = phpUnicode.getBytes("ISO-8859-1");

创建一个字节数组，其中包含的字节数与原始字符串的字符数相同，对于unicode值低于256的每个字符，字节数组中存储的数值相同

字符全宽拉丁大写字母A（U+FF41）不在原始字符串中，因此它不在ISO-8859-1中这一事实无关紧要

我知道将字符转换为字节时可能会出现转码错误，这就是为什么我说ISO-8859-1只会“将unicode值小于256的每个字符映射为具有相同值的字节”

所讨论的字符是U+FF21（全宽拉丁大写字母a）。PHP表单（\xEF\xBC\xA1）是一个UTF-8编码的八位字节序列

为了将此序列解码为Java字符串（始终为UTF-16），您将使用以下代码：

// \xEF\xBC\xA1
byte[] utf8 = { (byte) 0xEF, (byte) 0xBC, (byte) 0xA1 };
String utf16 = new String(utf8, Charset.forName("UTF-8"));

// print the char as hex   
for(char ch : utf16.toCharArray()) {
    System.out.format("%02x%n", (int) ch);
}

如果要从字符串文字解码数据，可以使用以下形式的代码：

public static void main(String[] args) {
  String utf16 = transformString("This is \\xEF\\xBC\\xA1 string");
  for (char ch : utf16.toCharArray()) {
    System.out.format("%s %02x%n", ch, (int) ch);
  }
}

private static final Pattern SEQ 
                           = Pattern.compile("(\\\\x\\p{Alnum}\\p{Alnum})+");

private static String transformString(String encoded) {
  StringBuilder decoded = new StringBuilder();
  Matcher matcher = SEQ.matcher(encoded);
  int last = 0;
  while (matcher.find()) {
    decoded.append(encoded.substring(last, matcher.start()));
    byte[] utf8 = toByteArray(encoded.substring(matcher.start(), matcher.end()));
    decoded.append(new String(utf8, Charset.forName("UTF-8")));
    last = matcher.end();
  }
  return decoded.append(encoded.substring(last, encoded.length())).toString();
}

private static byte[] toByteArray(String hexSequence) {
  byte[] utf8 = new byte[hexSequence.length() / 4];
  for (int i = 0; i < utf8.length; i++) {
    int offset = i * 4;
    String hex = hexSequence.substring(offset + 2, offset + 4);
    utf8[i] = (byte) Integer.parseInt(hex, 16);
  }
  return utf8;
}

publicstaticvoidmain（字符串[]args）{
字符串utf16=transformString（“这是\\xEF\\xBC\\xA1字符串”）；
for（字符ch:utf16.toCharArray（））{
系统输出格式（“%s%02x%n”，ch，（int）ch）；
}
}
私有静态最终模式SEQ
=Pattern.compile（（\\\\x\\p{Alnum}\\p{Alnum}）+）；
私有静态字符串转换字符串（字符串编码）{
StringBuilder decoded=新StringBuilder（）；
匹配器匹配器=序列匹配器（编码）；
int last=0；
while（matcher.find（））{
decoded.append（encoded.substring（last，matcher.start（））；
字节[]utf8=toByteArray（编码的.substring（matcher.start（），matcher.end（））；
decoded.append（新字符串（utf8，Charset.forName（“UTF-8”））；
last=matcher.end（）；
}
返回decoded.append（encoded.substring（last，encoded.length（））.toString（）；
}
专用静态字节[]toByteArray（字符串hexSequence）{
字节[]utf8=新字节[hexSequence.length（）/4]；
对于（int i=0；i

您的输入是字符串格式（

\xNN

）还是二进制格式？很好，但是我需要将\xNN\xNN字符串转换为unicode字符串，我已经编写了一个捕捉NN字符的regexp，但是如何从NN创建unicode字符串？F.e.我有NN我需要“\u0NN”（字符串添加在这里不起作用）Java字符串是UTF-16；试图在其中表示UTF-8（

“\u00EF\u00BC\u00A1”

）只会导致代码转换错误。在任何情况下，ISO-8859-1中都不存在全宽拉丁字母大写字母A。

public static void main(String[] args) {
  String utf16 = transformString("This is \\xEF\\xBC\\xA1 string");
  for (char ch : utf16.toCharArray()) {
    System.out.format("%s %02x%n", ch, (int) ch);
  }
}

private static final Pattern SEQ 
                           = Pattern.compile("(\\\\x\\p{Alnum}\\p{Alnum})+");

private static String transformString(String encoded) {
  StringBuilder decoded = new StringBuilder();
  Matcher matcher = SEQ.matcher(encoded);
  int last = 0;
  while (matcher.find()) {
    decoded.append(encoded.substring(last, matcher.start()));
    byte[] utf8 = toByteArray(encoded.substring(matcher.start(), matcher.end()));
    decoded.append(new String(utf8, Charset.forName("UTF-8")));
    last = matcher.end();
  }
  return decoded.append(encoded.substring(last, encoded.length())).toString();
}

private static byte[] toByteArray(String hexSequence) {
  byte[] utf8 = new byte[hexSequence.length() / 4];
  for (int i = 0; i < utf8.length; i++) {
    int offset = i * 4;
    String hex = hexSequence.substring(offset + 2, offset + 4);
    utf8[i] = (byte) Integer.parseInt(hex, 16);
  }
  return utf8;
}