Utf 8 httpclient乱码扩展字符
我正在使用httpclient检索远程URL,需要获取诸如标题之类的内容 在某些情况下,我会得到像这个url一样的乱码扩展字符 我尝试过各种设置,但都没有用。有什么建议吗?我的配置如下:Utf 8 httpclient乱码扩展字符,utf-8,httpclient,apache-httpclient-4.x,Utf 8,Httpclient,Apache Httpclient 4.x,我正在使用httpclient检索远程URL,需要获取诸如标题之类的内容 在某些情况下,我会得到像这个url一样的乱码扩展字符 我尝试过各种设置,但都没有用。有什么建议吗?我的配置如下: private CloseableHttpClient httpclient = RemotePageUtils.getThreadSafeClient(); public String processMethod(String url, OutputStream out) throws IOExcepti
private CloseableHttpClient httpclient = RemotePageUtils.getThreadSafeClient();
public String processMethod(String url, OutputStream out) throws IOException, IllegalArgumentException{
[...]
BufferedReader in = null;
HttpEntity entity = null;
HttpGet httpget = null;
CloseableHttpResponse resp = null;
try {
httpget = new HttpGet(url);
resp = httpclient.execute(httpget);
entity = resp.getEntity();
String inLine;
in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));
while ((inLine = in.readLine()) != null) {
out.write(inLine.getBytes("UTF-8"));
}
} finally {
[...]
}
return null;
}
private static CloseableHttpClient getThreadSafeClient() {
SocketConfig socketConfig = SocketConfig.custom()
.setTcpNoDelay(true)
.build();
RequestConfig config = RequestConfig.custom()
.setConnectTimeout(3000)
.setSocketTimeout(7000)
.setStaleConnectionCheckEnabled(false)
.build();
List<Header> headers = new ArrayList<Header>();
headers.add(new BasicHeader("Accept-Charset","ISO-8859-1,US-ASCII,UTF-8,UTF-16;q=0.7,*;q=0.7"));
//accept gzipped
headers.add(new BasicHeader("Accept-Encoding","gzip,x-gzip,deflate,sdch"));
CloseableHttpClient client = HttpClientBuilder.create()
.setDefaultHeaders(headers)
.setDefaultRequestConfig(config)
.setDefaultSocketConfig(socketConfig)
.build();
return client;
}
private CloseableHttpClient-httpclient=RemotePageUtils.getThreadSafeClient();
公共字符串processMethod(字符串url、OutputStream out)抛出IOException、IllegalArgumentException{
[...]
BufferedReader in=null;
HttpEntity=null;
HttpGet-HttpGet=null;
CloseableHttpResponse resp=null;
试一试{
httpget=新的httpget(url);
resp=httpclient.execute(httpget);
entity=resp.getEntity();
字符串内联;
in=新的BufferedReader(新的InputStreamReader(entity.getContent(),“UTF-8”);
而((inLine=in.readLine())!=null){
out.write(inLine.getBytes(“UTF-8”);
}
}最后{
[...]
}
返回null;
}
私有静态CloseableHttpClient getThreadSafeClient(){
SocketConfig SocketConfig=SocketConfig.custom()
.setTcpNoDelay(真)
.build();
RequestConfig=RequestConfig.custom()
.setConnectTimeout(3000)
.setSocketTimeout(7000)
.setStaleConnectionCheckEnabled(假)
.build();
列表标题=新建ArrayList();
添加(新的BasicHeader(“接受字符集”、“ISO-8859-1、US-ASCII、UTF-8、UTF-16;q=0.7、*;q=0.7”);
//接受gzip
添加(新的BasicHeader(“接受编码”、“gzip、x-gzip、deflate、sdch”);
CloseableHttpClient客户端=HttpClientBuilder.create()
.setDefaultHeaders(标题)
.setDefaultRequestConfig(配置)
.setDefaultSocketConfig(socketConfig)
.build();
返回客户;
}
您盲目地将所有下载的页面解释为UTF-8,但您给出的示例链接不是UTF-8,而是ISO-8859-1
ISO-8859-1中的重音字母是一个字节>=128,在UTF-8中,这些字节必须遵循特定的模式,在其他情况下,它们被视为已损坏
但为什么要解码下载的字节,只是为了将字节写入文件
而不是:
in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));
while ((inLine = in.readLine()) != null) {
out.write(inLine.getBytes("UTF-8"));
}
将字节转换成字符串,然后再转换回来,只需复制字节即可
您可以使用Apache Commons IO执行此操作:
import org.apache.commons.io.IOUtils;
IOUtils.copy(entity.getContent(), out);
或者使用纯Java手动:
byte[] buf = new byte[16 * 1024];
int len = 0;
InputStream in = entity.getContent();
while ((len = in.read(buf)) >= 0) {
out.write(buf, 0, len);
}
这段代码的可读性稍微降低了一点,因为我实际上是在阅读这些行来处理它们。我修改了代码以猜测字符集,现在它可以工作并相应地调整Inputstreamreader。所以,我会接受你的回答,因为它为我指明了正确的方向,即使它没有完全回答问题