Java 使用HTMLcleaner清理HTML输入

Java 使用HTMLcleaner清理HTML输入,java,html,htmlcleaner,Java,Html,Htmlcleaner,我正在使用HTMLCleaner库编写一个java项目,并将输出保存为XML文件这是我编写的代码: URL urlSB = new URL("http://www.groupon.com/browse/chicago?z=skip"); URLConnection urlConnection = urlSB.openConnection(); urlConnection.addRequestProperty("User-Agent", "google.com"); urlConnection.c

我正在使用HTMLCleaner库编写一个java项目,并将输出保存为XML文件这是我编写的代码:

URL urlSB = new URL("http://www.groupon.com/browse/chicago?z=skip");
URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "google.com");
urlConnection.connect();
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());

// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
        tagNodeRoot , "cleaned.xml", "utf-8"
);

问题在于运行项目后,cleaned.xml文件为空。

问题在于您试图访问的页面被配置为重定向到HTTPS。不管出于什么原因,这都不起作用,因此输入流是空的。如果将URL更改为HTTPS,则工作正常:

URL urlSB = new URL("https://www.groupon.com/browse/chicago?z=skip");
URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:5.0) Gecko/20100101 Firefox/25.0");
urlConnection.connect();
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());
new PrettyXmlSerializer(props).writeToFile(tagNodeRoot, "cleaned.xml", "utf-8");