从Firefox复制剪贴板内容并在Ubuntu中使用Java读取时,剪贴板内容会被弄乱

从Firefox复制剪贴板内容并在Ubuntu中使用Java读取时,剪贴板内容会被弄乱,java,ubuntu,firefox,encoding,clipboard,Java,Ubuntu,Firefox,Encoding,Clipboard,背景 我正在尝试使用Java获取HTML数据风格的剪贴板数据。因此,我将它们从浏览器复制到剪贴板中。那我就用它来买 这在Windows系统中正常工作。但是在Ubuntu中有一些奇怪的问题。最糟糕的情况是将数据从Firefox浏览器复制到剪贴板 复制行为的示例 Java代码: import java.io.*; import java.awt.Toolkit; import java.awt.datatransfer.Clipboard; import java.awt.datatransfer

背景

我正在尝试使用Java获取HTML数据风格的剪贴板数据。因此,我将它们从浏览器复制到剪贴板中。那我就用它来买

这在Windows系统中正常工作。但是在Ubuntu中有一些奇怪的问题。最糟糕的情况是将数据从Firefox浏览器复制到剪贴板

复制行为的示例

Java代码:

import java.io.*;

import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;

public class WorkingWithClipboadData {

 static void doSomethingWithBytesFromClipboard(byte[] dataBytes, String paramCharset, int number) throws Exception {

  String fileName = "Result " + number + " " + paramCharset + ".txt";

  OutputStream fileOut = new FileOutputStream(fileName);
  fileOut.write(dataBytes, 0, dataBytes.length);
  fileOut.close();

 }

 public static void main(String[] args) throws Exception {

  Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();

  int count = 0;

  for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {

System.out.println(dataFlavor);

   String mimeType = dataFlavor.getHumanPresentableName();
   if ("text/html".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("java.io.InputStream".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && paramCharset.startsWith("UTF")) {

System.out.println("============================================");
System.out.println(paramCharset);
System.out.println("============================================");

      InputStream inputStream = (InputStream)clipboard.getData(dataFlavor);

      ByteArrayOutputStream data = new ByteArrayOutputStream();

      byte[] buffer = new byte[1024];
      int length = -1;
      while ((length = inputStream.read(buffer)) != -1) {
       data.write(buffer, 0, length);
      }
      data.flush();
      inputStream.close();

      byte[] dataBytes = data.toByteArray();
      data.close();

      doSomethingWithBytesFromClipboard(dataBytes, paramCharset, ++count);

     }
    }
   }
  }
 }

}
问题描述

我正在做的是,在Firefox中打开URL。然后我选择“字母:ä”并将其复制到剪贴板中。然后我运行我的Java程序。之后,生成的文件(仅其中一些作为示例)如下所示:

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt" 
00000000: feff fffd fffd 006c 0000 0065 0000 0074  .......l...e...t
00000010: 0000 0074 0000 0065 0000 0072 0000 0073  ...t...e...r...s
00000020: 0000 003a 0000 0020 0000 003c 0000 0069  ...:... ...<...i
00000030: 0000 003e 0000 fffd 0000 003c 0000 002f  ...>.......<.../
00000040: 0000 0069 0000 003e 0000                 ...i...>..
这里的
EFBF-BDEF-BFBD
看起来不像任何已知的字节顺序标记。所有字母似乎都用16位编码,这是
UTF-8
中所需位的两倍。因此,所使用的位似乎总是需要的双重计数。参见上文示例中的UTF-16。所有非ASCII字母都被编码为
EFBFBD
,因此也会丢失

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt" 
00000000: efbf bdef bfbd 6c00 6500 7400 7400 6500  ......l.e.t.t.e.
00000010: 7200 7300 3a00 2000 3c00 6900 3e00 efbf  r.s.:. .<.i.>...
00000020: bd00 3c00 2f00 6900 3e00                 ..<./.i.>.
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 7 UTF-16BE.txt" 
00000000: fffd fffd 006c 0000 0065 0000 0074 0000  .....l...e...t..
00000010: 0074 0000 0065 0000 0072 0000 0073 0000  .t...e...r...s..
00000020: 003a 0000 0020 0000 003c 0000 0069 0000  .:... ...<...i..
00000030: 003e 0000 fffd 0000 003c 0000 002f 0000  .>.......<.../..
00000040: 0069 0000 003e 0000                      .i...>..
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 10 UTF-16LE.txt" 
00000000: fdff fdff 6c00 0000 6500 0000 7400 0000  ....l...e...t...
00000010: 7400 0000 6500 0000 7200 0000 7300 0000  t...e...r...s...
00000020: 3a00 0000 2000 0000 3c00 0000 6900 0000  :... ...<...i...
00000030: 3e00 0000 fdff 0000 3c00 0000 2f00 0000  >.......<.../...
00000040: 6900 0000 3e00 0000                      i...>...
只有这样才能完成。上图相同

所以结论是,在从Firefox复制了一些东西到Ubuntu剪贴板之后,它完全被弄乱了。至少对于HTML数据风格和使用Java读取剪贴板时是这样

使用的其他浏览器

当我使用Chromium浏览器作为数据源来做同样的事情时,问题就会变得更小

所以我在Chromium中打开URL。然后我选择“字母:ä”并将其复制到剪贴板中。然后我运行我的Java程序

结果如下:

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt" 
00000000: feff 003c 006d 0065 0074 0061 0020 0068  ...<.m.e.t.a. .h
...
00000800: 0061 006c 003b 0022 003e 00e4 003c 002f  .a.l.;.".>...<./
00000810: 0069 003e 0000                           .i.>..
同上。编码正确地查找
UTF-8
。但这里还有一个额外的
00
字节在末尾,不应该在那里

环境

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"


Mozilla Firefox 61.0.1 (64-Bit)


java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
问题

我的代码是否有问题

有人能建议如何避免在剪贴板中出现混乱的内容吗?由于非ASCII字符丢失,至少在从Firefox复制时,我认为我们无法修复此内容

不知为什么,这是一个已知的问题吗?有人能证实同样的行为吗?如果是这样的话,Firefox中是否已经有关于这方面的bug报告

或者这是一个只有当Java代码读取剪贴板内容时才会发生的问题?好像。因为如果我从Firefox复制内容并将其粘贴到Libreoffice Writer中,Unicode就会正确显示。然后,如果我将内容从Writer复制到剪贴板,并使用Java程序读取,则除了结尾的额外
00
字节外,
UTF
编码是正确的。因此,从Writer复制的剪贴板内容的行为类似于从Chromium浏览器复制的内容


新见解

字节
0xFFFD
似乎是Unicode字符“替换字符”(U+FFFD)。因此,
0xFDFF
是它的小端表示,
0xEFBFBD
是它的UTF-8编码。所以所有的结果似乎都是错误解码和重新编码Unicode的结果

似乎Firefox的剪贴板内容总是
UTF-16LE
with
BOM
。但是
Java
将其作为
UTF-8
获取。因此,2字节BOM变成两个混乱的字符,用0xEFBFBD替换,每个额外的
0x00
序列变成它们自己的
NUL
字符,所有不正确的
UTF-8
字节序列变成混乱的字符,用0xEFBFBD替换。然后这个伪UTF-8将被重新编码。现在,垃圾已经完成了

例如:

带有BOM的UTF-16LE中的序列
aɛaüa
0xFFFE 6100 5B02 6100 FC00 6100

这被视为UTF-8(0xEFBFBD=不是正确的UTF-8字节序列)= 0xEFBFBD 0xEFBFBD
a
NUL
STX
a
NUL
0xEFBFBD
NUL
a
NUL

重新编码为UTF-16LE的伪ASCII将为:
0xFDFF FDFF 6100 0000 5B00200 6100 0000 FDFF 0000 6100 0000

将此伪ASCII重新编码为UTF-8
0xEFBF BDEF BFBD 6100 5B02 6100 EFBF BD00 6100

而这正是发生的事情

其他例子:

UTF-16LE中的0x00C2=
C200
伪UTF-8中的LE=0xEFBFBD00

胂=0x80C2=
C280
伪UTF-8中的代码=0xC280

所以我认为这不应该怪
Firefox
,而应该怪
Ubuntu
Java
的运行时环境。因为从Firefox复制/粘贴到Writer在Ubuntu中工作,我认为
Java
的运行时环境不能正确处理
Ubuntu
剪贴板中的Firefox数据风格


新见解:

我比较了我的
windows10
和我的
Ubuntu
flavormap.properties
文件,发现有一点不同。在
Ubuntu
中,
text/html
的本机名称是
UTF8\u STRING
,而在
Windows
中它是
html格式
。所以我认为这可能是个问题。所以我添加了一行

HTML\Format=text/HTML;charset=utf-8;eoln=“\n”;terminators=0

到我的
flavormap.properties
文件中的
Ubuntu

之后:

Map<DataFlavor,String> nativesForFlavors = SystemFlavorMap.getDefaultFlavorMap().getNativesForFlavors(
   new DataFlavor[]{
   new DataFlavor("text/html;charset=UTF-16LE")
   });

System.out.println(nativesForFlavors);

但是当被Java读取时,Ubuntu剪贴板内容的结果没有变化。

看了很久之后,看起来是这样的(甚至是更旧的报告)

对于X11,Java组件似乎期望剪贴板数据始终是UTF-8编码的,而Firefox使用UTF-16编码数据。由于假设,Java会通过强制将UTF-16解析为UTF-8来破坏文本。我尝试过,但无法做到
Map<DataFlavor,String> nativesForFlavors = SystemFlavorMap.getDefaultFlavorMap().getNativesForFlavors(
   new DataFlavor[]{
   new DataFlavor("text/html;charset=UTF-16LE")
   });

System.out.println(nativesForFlavors);
{java.awt.datatransfer.DataFlavor[mimetype=text/html;representationclass=java.io.InputStream;charset=UTF-16LE]=HTML Format}
import java.io.*;

import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;

import java.nio.charset.Charset;

public class WorkingWithClipboadDataBytesUTF8 {

 static byte[] repairUTF8HTMLDataBytes(byte[] plainDataBytes, byte[] htmlDataBytes) throws Exception {

  //get all the not ASCII characters from plainDataBytes
  //we need them for replacement later
  String plain = new String(plainDataBytes, Charset.forName("UTF-8"));
  char[] chars = plain.toCharArray();
  StringBuffer unicodeChars = new StringBuffer();
  for (int i = 0; i < chars.length; i++) {
   if (chars[i] > 127) unicodeChars.append(chars[i]);
  }
System.out.println(unicodeChars);

  //ommit the first 6 bytes from htmlDataBytes which are the wrong BOM
  htmlDataBytes = java.util.Arrays.copyOfRange(htmlDataBytes, 6, htmlDataBytes.length);

  //The wrong UTF-8 encoded single bytes which are not replaced by `0xefbfbd` 
  //are coincidentally UTF-16LE if two bytes immediately following each other.
  //So we are "repairing" this accordingly. 
  //Goal: all garbage shall be the replacement character 0xFFFD.

  //replace parts of a surrogate pair with 0xFFFD
  //replace the wrong UFT-8 bytes 0xefbfbd for replacement character with 0xFFFD
  ByteArrayInputStream in = new ByteArrayInputStream(htmlDataBytes);
  ByteArrayOutputStream out = new ByteArrayOutputStream();
  int b = -1;
  int[] btmp = new int[6];
  while ((b = in.read()) != -1) {
   btmp[0] = b;
   btmp[1] = in.read(); //there must always be two bytes because of wron encoding 16 bit Unicode
   if (btmp[0] != 0xef && btmp[1] != 0xef) { // not a replacement character
    if (btmp[1] > 0xd7 && btmp[1] < 0xe0) { // part of a surrogate pair
     out.write(0xFD); out.write(0xFF);
    } else {
     out.write(btmp[0]); out.write(btmp[1]); //two default bytes
    }
   } else { // at least one must be the replacelement 0xefbfbd
    btmp[2] = in.read(); btmp[3] = in.read(); //there must be at least two further bytes
    if (btmp[0] != 0xef && btmp[1] == 0xef && btmp[2] == 0xbf && btmp[3] == 0xbd ||
        btmp[0] == 0xef && btmp[1] == 0xbf && btmp[2] == 0xbd && btmp[3] != 0xef) {
     out.write(0xFD); out.write(0xFF);
    } else if (btmp[0] == 0xef && btmp[1] == 0xbf && btmp[2] == 0xbd && btmp[3] == 0xef) {
     btmp[4] = in.read(); btmp[5] = in.read();
     if (btmp[4] == 0xbf &&  btmp[5] == 0xbd) {
      out.write(0xFD); out.write(0xFF);
     } else {
      throw new Exception("Wrong byte sequence: "
      + String.format("%02X%02X%02X%02X%02X%02X", btmp[0], btmp[1], btmp[2], btmp[3], btmp[4], btmp[5]), 
      new Throwable().fillInStackTrace());
     }
    } else {
     throw new Exception("Wrong byte sequence: " 
      + String.format("%02X%02X%02X%02X%02X%02X", btmp[0], btmp[1], btmp[2], btmp[3], btmp[4], btmp[5]),
      new Throwable().fillInStackTrace());
    }
   }
  }

  htmlDataBytes = out.toByteArray();

  //now get this as UTF_16LE (2 byte for each character, little endian)
  String html = new String(htmlDataBytes, Charset.forName("UTF-16LE"));
System.out.println(html);

  //replace all of the wrongUnicode with the unicodeChars selected from plainDataBytes
  boolean insideTag = false;
  int unicodeCharCount = 0;
  char[] textChars = html.toCharArray();
  StringBuffer newHTML = new StringBuffer();
  for (int i = 0; i < textChars.length; i++) {
   if (textChars[i] == '<') insideTag = true;
   if (textChars[i] == '>') insideTag = false;
   if (!insideTag && textChars[i] > 127) {
    if (unicodeCharCount >= unicodeChars.length()) 
     throw new Exception("Unicode chars count don't match. " 
      + "We got from plain text " + unicodeChars.length() + " chars. Text until now:\n" + newHTML,
      new Throwable().fillInStackTrace());

    newHTML.append(unicodeChars.charAt(unicodeCharCount++));
   } else {
    newHTML.append(textChars[i]);
   }
  }

  html = newHTML.toString();
System.out.println(html);

  return html.getBytes("UTF-8");

 }

 static void doSomethingWithUTF8BytesFromClipboard(byte[] plainDataBytes, byte[] htmlDataBytes) throws Exception {

  if (plainDataBytes != null && htmlDataBytes != null) {

   String fileName; 
   OutputStream fileOut;

   fileName = "ResultPlainText.txt";
   fileOut = new FileOutputStream(fileName);
   fileOut.write(plainDataBytes, 0, plainDataBytes.length);
   fileOut.close();

   fileName = "ResultHTMLRaw.txt";
   fileOut = new FileOutputStream(fileName);
   fileOut.write(htmlDataBytes, 0, htmlDataBytes.length);
   fileOut.close();

   //do we have wrong encoded UTF-8 in htmlDataBytes?
   if (htmlDataBytes[0] == (byte)0xef && htmlDataBytes[1] == (byte)0xbf && htmlDataBytes[2] == (byte)0xbd 
    && htmlDataBytes[3] == (byte)0xef && htmlDataBytes[4] == (byte)0xbf && htmlDataBytes[5] == (byte)0xbd) {
    //try repair the UTF-8 HTML data bytes
    htmlDataBytes = repairUTF8HTMLDataBytes(plainDataBytes, htmlDataBytes);
          //do we have additional 0x00 byte at the end?
   } else if (htmlDataBytes[htmlDataBytes.length-1] == (byte)0x00) {
    //do repair this
    htmlDataBytes = java.util.Arrays.copyOf(htmlDataBytes, htmlDataBytes.length-1);
   }

   fileName = "ResultHTML.txt";
   fileOut = new FileOutputStream(fileName);
   fileOut.write(htmlDataBytes, 0, htmlDataBytes.length);
   fileOut.close();

  }

 }

 public static void main(String[] args) throws Exception {

  Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();

  byte[] htmlDataBytes = null;
  byte[] plainDataBytes = null;

  for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {

   String mimeType = dataFlavor.getHumanPresentableName();

   if ("text/html".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("[B".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && "UTF-8".equalsIgnoreCase(paramCharset)) {

      htmlDataBytes = (byte[])clipboard.getData(dataFlavor);

     }
    } //else if("java.io.InputStream".equals(paramClass)) ...

   } else if ("text/plain".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("[B".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && "UTF-8".equalsIgnoreCase(paramCharset)) {

      plainDataBytes = (byte[])clipboard.getData(dataFlavor);

     }
    } //else if("java.io.InputStream".equals(paramClass)) ...
   }
  }

  doSomethingWithUTF8BytesFromClipboard(plainDataBytes, htmlDataBytes);

 }

}