Java 获取页面源时未正确复制符号_Java_Apache Httpclient 4.x

Java 获取页面源时未正确复制符号

java

Java 获取页面源时未正确复制符号,java,apache-httpclient-4.x,Java,Apache Httpclient 4.x,我正在尝试使用以下代码获取网页的源代码： public static String getFile(String sUrl) throws ClientProtocolException, IOException { DefaultHttpClient httpclient = new DefaultHttpClient(); StringBuilder b = new StringBuilder(); // Prepare a request object H

我正在尝试使用以下代码获取网页的源代码：

public static String getFile(String sUrl) throws ClientProtocolException, IOException {
    DefaultHttpClient httpclient = new DefaultHttpClient();
    StringBuilder b = new StringBuilder();

    // Prepare a request object
    HttpGet httpget = new HttpGet(sUrl);

    // Execute the request
    HttpResponse response = httpclient.execute(httpget);

    // Examine the response status
    System.out.println(response.getStatusLine());

    //status code should be 200
    if (response.getStatusLine().getStatusCode() != 200) {
        return null; 
    }

    // Get hold of the response entity
    HttpEntity entity = response.getEntity();

    // If the response does not enclose an entity, there is no need
    // to worry about connection release
    if (entity != null) {
        InputStream instream = entity.getContent();

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(instream));
            // do something useful with the response
            String s = reader.readLine();

            while (s != null) {
                b.append(s);
                b.append("\n");
                s = reader.readLine();
            }

        } catch (IOException ex) {
            // In case of an IOException the connection will be released
            // back to the connection manager automatically
            throw ex;

        } catch (RuntimeException ex) {
            // In case of an unexpected exception you may want to abort
            // the HTTP request in order to shut down the underlying
            // connection and release it back to the connection manager.
            httpget.abort();
            throw ex;

        } finally {
            // Closing the input stream will trigger connection release
            instream.close();
        }

        // When HttpClient instance is no longer needed,
        // shut down the connection manager to ensure
        // immediate deallocation of all system resources
        httpclient.getConnectionManager().shutdown();
    }

    return b.toString();
}

它工作正常，但某些符号（如、、单引号等）没有正确复制。我尝试将页面源代码保存为

text/html

并键入

amazons3

，然后通过访问

s3服务器中保存的页面来显示它
我上面提到的符号显示为�。
有什么解决方案吗？
您需要确保您正在使用页面编码阅读内容，否则将使用您的系统默认编码（这显然不是您所看到的正确编码）：
您需要确保阅读内容时使用了页面编码，否则将使用系统默认编码（显然不是您所看到的正确编码）：
您需要确保阅读内容时使用了页面编码，否则将使用系统默认编码（显然不是您所看到的正确编码）：
您需要确保阅读内容时使用了页面编码，否则将使用系统默认编码（显然不是您所看到的正确编码）：
首先需要指定InputStreamReader使用的编码。您的构造函数版本采用系统上的默认编码
编码可以在标题中传递。它默认为ISO-8859-1，但实际上是Windows-1252（Windows拉丁语-1）
对于HTML实体，apache具有：
String s = ...
s = StringEscapeUtils.unescapeHTML4(s);

首先需要指定InputStreamReader使用的编码。您的构造函数版本采用系统上的默认编码
编码可以在标题中传递。它默认为ISO-8859-1，但实际上是Windows-1252（Windows拉丁语-1）
对于HTML实体，apache具有：
String s = ...
s = StringEscapeUtils.unescapeHTML4(s);

首先需要指定InputStreamReader使用的编码。您的构造函数版本采用系统上的默认编码
编码可以在标题中传递。它默认为ISO-8859-1，但实际上是Windows-1252（Windows拉丁语-1）
对于HTML实体，apache具有：
String s = ...
s = StringEscapeUtils.unescapeHTML4(s);

首先需要指定InputStreamReader使用的编码。您的构造函数版本采用系统上的默认编码
编码可以在标题中传递。它默认为ISO-8859-1，但实际上是Windows-1252（Windows拉丁语-1）
对于HTML实体，apache具有：
String s = ...
s = StringEscapeUtils.unescapeHTML4(s);

@Joop Eggen，我尝试了以下代码：“String enc=“Windows-1252”；ContentType ContentType=ContentType.getOrDefault（实体）；Charset Charset=contentType.getCharset（）；if（StringUtils.isNotEmpty（enc））{enc=charset.toString（）；}'和我找到的内容类型：text/html；字符集=ISO-8859-1。我尝试使用Windows-1252和utf-8，但没有尝试使用实体的内容类型。但问题仍然存在，在原始文档中，现在显示为？……谢谢advance@JoopEggen，我尝试了以下代码：“String enc=“Windows-1252”；ContentType ContentType=ContentType.getOrDefault（实体）；Charset Charset=contentType.getCharset（）；if（StringUtils.isNotEmpty（enc））{enc=charset.toString（）；}'和我找到的内容类型：text/html；字符集=ISO-8859-1。我尝试使用Windows-1252和utf-8，但没有尝试使用实体的内容类型。但问题仍然存在，在原始文档中，现在显示为？……谢谢advance@JoopEggen，我尝试了以下代码：“String enc=“Windows-1252”；ContentType ContentType=ContentType.getOrDefault（实体）；Charset Charset=contentType.getCharset（）；if（StringUtils.isNotEmpty（enc））{enc=charset.toString（）；}'和我找到的内容类型：text/html；字符集=ISO-8859-1。我尝试使用Windows-1252和utf-8，但没有尝试使用实体的内容类型。但问题仍然存在，在原始文档中，现在显示为？……谢谢advance@JoopEggen，我尝试了以下代码：“String enc=“Windows-1252”；ContentType ContentType=ContentType.getOrDefault（实体）；Charset Charset=contentType.getCharset（）；if（StringUtils.isNotEmpty（enc））{enc=charset.toString（）；}'和我找到的内容类型：text/html；字符集=ISO-8859-1。我尝试使用Windows-1252和utf-8，但没有尝试使用实体的内容类型。但问题仍然存在，原始文档中现在显示为？……提前感谢