Java 为什么我从html中获得unicode？_Java_Parsing_Unicode_Encode

Java 为什么我从html中获得unicode？

java parsing unicode

Java 为什么我从html中获得unicode？,java,parsing,unicode,encode,Java,Parsing,Unicode,Encode,我编写了一个没有第三方库的解析器。从网站获取html代码--但代码的某些部分使用unicode符号，例如：“\u003cbr/>登录您的电视服务提供商以访问\u003cbr/>”我认为编码有问题-如何解决？对不起我的英语。多谢各位 public class Main { public static void main(String[] args) throws IOException { String commandLine = Scraper.readLineFromConso

我编写了一个没有第三方库的解析器。从网站获取html代码--但代码的某些部分使用unicode符号，例如：“\u003cbr/>登录您的电视服务提供商以访问\u003cbr/>”我认为编码有问题-如何解决？对不起我的英语。多谢各位

    public class Main {
public static void main(String[] args) throws IOException {
    String commandLine = Scraper.readLineFromConsole();
    Reader reader = Scraper.getReader(commandLine);
    Scraper.writeInFileFromURL(reader);
}

public static class Scraper {
    public static void writeInFileFromURL(Reader out) {
        Reader reader = out;
        BufferedReader br = new BufferedReader(reader);

        try {
            PrintWriter writer = new PrintWriter("newFile.txt");
            String htmltext;
            while (br.ready()) {
                htmltext = br.readLine();
                writer.write(new String(htmltext));
            }
            writer.flush();
            writer.close();

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static String readLineFromConsole() {
        BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
        String commandLine = null;
        try {
            commandLine = reader.readLine();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return commandLine;
    }

    public static Reader getReader(String url)
            throws IOException {
        // Retrieve from Internet.
        if (url.startsWith("http:") || url.startsWith("https:")) {
            URLConnection conn = new URL(url).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else {
            return new FileReader(url);
        }
    }
}

}

你为什么要重新发明轮子？你有没有试着告诉读者你使用的字符集？例如，使用

newInputStreamReader（newFileInputStream（yourFileName），StandardCharsets.UTF_8））

看起来源代码就是这样写的，作为在JS代码中转义XML/XHTML的一种方式。我希望在执行后生成的HTML不会包含这些元素，如果你问起“while”循环的话-它是最基本的MSRD0，我试过你的方法-它没有帮助=/