Java Hadoop thriftfs读取中的额外EFBFBD字节_Java_Python_Hadoop_Character Encoding_Thrift

Java Hadoop thriftfs读取中的额外EFBFBD字节

java python hadoop character-encoding

Java Hadoop thriftfs读取中的额外EFBFBD字节,java,python,hadoop,character-encoding,thrift,Java,Python,Hadoop,Character Encoding,Thrift,在hadoop-0.20中，我们有一个thriftfs contrib，它允许我们使用其他编程语言访问hdfs。Hadoop提供了一个hdfs.py脚本用于演示。问题出在do_get和do_put方法中如果我们使用get下载UTF-8文本文件，这是完全可以的，但是当我们get使用其他编码的文件时，我们无法获得原始文件，下载的文件有许多额外的“EFBFBD”字节。我想HadoopThriftServer上的这些Java代码可能会导致这些问题 public String read(ThriftHa

在hadoop-0.20中，我们有一个thriftfs contrib，它允许我们使用其他编程语言访问hdfs。Hadoop提供了一个hdfs.py脚本用于演示。问题出在

do_get

和

do_put

方法中

如果我们使用

get

下载UTF-8文本文件，这是完全可以的，但是当我们

get

使用其他编码的文件时，我们无法获得原始文件，下载的文件有许多额外的“EFBFBD”字节。我想HadoopThriftServer上的这些Java代码可能会导致这些问题

public String read(ThriftHandle tout, long offset,
                    int length) throws ThriftIOException {
   try {
     now = now();
     HadoopThriftHandler.LOG.debug("read: " + tout.id +
                                  " offset: " + offset +
                                  " length: " + length);
     FSDataInputStream in = (FSDataInputStream)lookup(tout.id);
     if (in.getPos() != offset) {
       in.seek(offset);
     }
     byte[] tmp = new byte[length];
     int numbytes = in.read(offset, tmp, 0, length);
     HadoopThriftHandler.LOG.debug("read done: " + tout.id);
     return new String(tmp, 0, numbytes, "UTF-8");
   } catch (IOException e) {
     throw new ThriftIOException(e.getMessage());
   }
 }

hdfs.py中的Python代码是

希望任何人都能帮助我。

谢谢。

我注意到两件事：1<代码>新字符串（tmp，0，numbytes，“UTF-8”）仅在输入字节是UTF-8编码的文本时才按预期工作，因此可能这就是为什么只有UTF-8文件工作的原因—因为您硬编码了UTF-8而不是使用适当的编码—和2。对于无效字符，EFBFBD为UTF-8。你拿那些绳子干什么？如果你只是想复制一个文件，你不应该把它转换成字符串。谢谢。实际上，这些代码来自Hadoop 0.20版本，我发现问题是Hadoop Thrift contrib中的一个bug，需要将字符串更改为ByteBuffer来修复它。硬编码编码编码<代码>跳过而不检查返回值。跳过字节，返回字符。没有面向字节的API。这段代码的作者住在哪里，所以我可以打他的头？为了澄清卡罗尔的答案：

EFBFBD

是UTF-8编码，即臭名昭著的带问号的正方形(�).

output = open(local, 'wb')
path = Pathname();
path.pathname = hdfs;
input = self.client.open(path)

# find size of hdfs file
filesize = self.client.stat(path).length

# read 1MB bytes at a time from hdfs
offset = 0
chunksize = 1024 * 1024
while True:
   chunk = self.client.read(input, offset, chunksize)
   if not chunk: break
   output.write(chunk)
   offset += chunksize
   if (offset >= filesize): break

self.client.close(input)
output.close()