Utf 8 httpclient乱码扩展字符

Utf 8 httpclient乱码扩展字符,utf-8,httpclient,apache-httpclient-4.x,Utf 8,Httpclient,Apache Httpclient 4.x,我正在使用httpclient检索远程URL,需要获取诸如标题之类的内容 在某些情况下,我会得到像这个url一样的乱码扩展字符 我尝试过各种设置,但都没有用。有什么建议吗?我的配置如下: private CloseableHttpClient httpclient = RemotePageUtils.getThreadSafeClient(); public String processMethod(String url, OutputStream out) throws IOExcepti

我正在使用httpclient检索远程URL,需要获取诸如标题之类的内容

在某些情况下,我会得到像这个url一样的乱码扩展字符

我尝试过各种设置,但都没有用。有什么建议吗?我的配置如下:

private CloseableHttpClient httpclient = RemotePageUtils.getThreadSafeClient();

public String processMethod(String url, OutputStream out) throws IOException, IllegalArgumentException{

    [...]

    BufferedReader in = null;
    HttpEntity entity = null;
    HttpGet httpget = null;

    CloseableHttpResponse resp = null;

    try {

        httpget = new HttpGet(url);

        resp = httpclient.execute(httpget);

        entity = resp.getEntity();

        String inLine;

        in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));

        while ((inLine = in.readLine()) != null) {

            out.write(inLine.getBytes("UTF-8"));
        }

    } finally {

        [...]

    }
    return null;
}

private static CloseableHttpClient getThreadSafeClient() {

    SocketConfig socketConfig = SocketConfig.custom()
            .setTcpNoDelay(true)
            .build();

    RequestConfig config = RequestConfig.custom()
            .setConnectTimeout(3000)
            .setSocketTimeout(7000)
            .setStaleConnectionCheckEnabled(false)
            .build();

    List<Header> headers = new ArrayList<Header>();
    headers.add(new BasicHeader("Accept-Charset","ISO-8859-1,US-ASCII,UTF-8,UTF-16;q=0.7,*;q=0.7"));
    //accept gzipped
    headers.add(new BasicHeader("Accept-Encoding","gzip,x-gzip,deflate,sdch"));


    CloseableHttpClient client = HttpClientBuilder.create()
            .setDefaultHeaders(headers)
            .setDefaultRequestConfig(config)
            .setDefaultSocketConfig(socketConfig)
            .build();

    return client;

}
private CloseableHttpClient-httpclient=RemotePageUtils.getThreadSafeClient();
公共字符串processMethod(字符串url、OutputStream out)抛出IOException、IllegalArgumentException{
[...]
BufferedReader in=null;
HttpEntity=null;
HttpGet-HttpGet=null;
CloseableHttpResponse resp=null;
试一试{
httpget=新的httpget(url);
resp=httpclient.execute(httpget);
entity=resp.getEntity();
字符串内联;
in=新的BufferedReader(新的InputStreamReader(entity.getContent(),“UTF-8”);
而((inLine=in.readLine())!=null){
out.write(inLine.getBytes(“UTF-8”);
}
}最后{
[...]
}
返回null;
}
私有静态CloseableHttpClient getThreadSafeClient(){
SocketConfig SocketConfig=SocketConfig.custom()
.setTcpNoDelay(真)
.build();
RequestConfig=RequestConfig.custom()
.setConnectTimeout(3000)
.setSocketTimeout(7000)
.setStaleConnectionCheckEnabled(假)
.build();
列表标题=新建ArrayList();
添加(新的BasicHeader(“接受字符集”、“ISO-8859-1、US-ASCII、UTF-8、UTF-16;q=0.7、*;q=0.7”);
//接受gzip
添加(新的BasicHeader(“接受编码”、“gzip、x-gzip、deflate、sdch”);
CloseableHttpClient客户端=HttpClientBuilder.create()
.setDefaultHeaders(标题)
.setDefaultRequestConfig(配置)
.setDefaultSocketConfig(socketConfig)
.build();
返回客户;
}

您盲目地将所有下载的页面解释为UTF-8,但您给出的示例链接不是UTF-8,而是ISO-8859-1

ISO-8859-1中的重音字母是一个字节>=128,在UTF-8中,这些字节必须遵循特定的模式,在其他情况下,它们被视为已损坏

但为什么要解码下载的字节,只是为了将字节写入文件

而不是:

 in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));
 while ((inLine = in.readLine()) != null) {
     out.write(inLine.getBytes("UTF-8"));
 }
将字节转换成字符串,然后再转换回来,只需复制字节即可

您可以使用Apache Commons IO执行此操作:

import org.apache.commons.io.IOUtils;

IOUtils.copy(entity.getContent(), out);
或者使用纯Java手动:

byte[] buf = new byte[16 * 1024];
int len = 0;
InputStream in = entity.getContent();
while ((len = in.read(buf)) >= 0) {
    out.write(buf, 0, len);
}

这段代码的可读性稍微降低了一点,因为我实际上是在阅读这些行来处理它们。我修改了代码以猜测字符集,现在它可以工作并相应地调整Inputstreamreader。所以,我会接受你的回答,因为它为我指明了正确的方向,即使它没有完全回答问题