从HTTP输入流构建时,JavaXXML解析器卡住
我试图打开一个与网站的HTTP连接,并将html解析为从HTTP输入流构建时,JavaXXML解析器卡住,java,html,xml,parsing,well-formed,Java,Html,Xml,Parsing,Well Formed,我试图打开一个与网站的HTTP连接,并将html解析为org.w3c.dom.Document类。我可以打开HTTP连接并将网页输出到控制台,但如果我将InputStream对象传递给XML解析器,它会挂起一分钟并输出错误 [Fatal Error] :108:55: Open quote is expected for attribute "{1}" associated with an element type "onload". 代码: private static Document
org.w3c.dom.Document
类。我可以打开HTTP连接并将网页输出到控制台,但如果我将InputStream对象传递给XML解析器,它会挂起一分钟并输出错误
[Fatal Error] :108:55: Open quote is expected for attribute "{1}" associated with an element type "onload".
代码:
private static Document getInputStream(String url) throws IOException, SAXException, ParserConfigurationException
{
System.out.println(url);
URL webUrl = new URL(url);
URLConnection connection = webUrl.openConnection();
connection.setConnectTimeout(60 * 1000);
connection.setReadTimeout(60 * 1000);
InputStream stream = connection.getInputStream();
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(stream); // This line is hanging
return doc;
}
暂停时的堆栈跟踪:
Thread [main] (Suspended)
SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]
SocketInputStream.read(byte[], int, int) line: not available
BufferedInputStream.fill() line: not available
BufferedInputStream.read1(byte[], int, int) line: not available
BufferedInputStream.read(byte[], int, int) line: not available
HttpClient.parseHTTPHeader(MessageHeader, ProgressSource, HttpURLConnection) line: not available
HttpClient.parseHTTP(MessageHeader, ProgressSource, HttpURLConnection) line: not available
HttpURLConnection.getInputStream() line: not available
XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean) line: not available
XMLEntityManager.startEntity(String, XMLInputSource, boolean, boolean) line: not available
XMLEntityManager.startDTDEntity(XMLInputSource) line: not available
XMLDTDScannerImpl.setInputSource(XMLInputSource) line: not available
XMLDocumentScannerImpl$DTDDriver.dispatch(boolean) line: not available
XMLDocumentScannerImpl$DTDDriver.next() line: not available
XMLDocumentScannerImpl$PrologDriver.next() line: not available
XMLNSDocumentScannerImpl(XMLDocumentScannerImpl).next() line: not available
XMLNSDocumentScannerImpl.next() line: not available
XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl).scanDocument(boolean) line: not available
XIncludeAwareParserConfiguration(XML11Configuration).parse(boolean) line: not available
XIncludeAwareParserConfiguration(XML11Configuration).parse(XMLInputSource) line: not available
DOMParser(XMLParser).parse(XMLInputSource) line: not available
DOMParser.parse(InputSource) line: not available
DocumentBuilderImpl.parse(InputSource) line: not available
DocumentBuilderImpl(DocumentBuilder).parse(InputStream) line: not available
MSCommunicator.getInputStream(String) line: 45
MSCommunicator.getGamePageFromForum(int, int, int) line: 70
MSCommunicator.getGamePageFromForum(int, int) line: 57
Game.<init>(int, int) line: 21
MSCommunicator.main(String[]) line: 26
Thread[main](挂起)
SocketInputStream.socketRead0(文件描述符,字节[],int,int,int)行:不可用[本机方法]
SocketInputStream.read(字节[],int,int)行:不可用
BufferedInputStream.fill()行:不可用
BufferedInputStream.read1(字节[],int,int)行:不可用
BufferedInputStream.read(字节[],int,int)行:不可用
HttpClient.parseHTTPHeader(MessageHeader、ProgressSource、HttpURLConnection)行:不可用
HttpClient.parseHTTP(MessageHeader,ProgressSource,HttpURLConnection)行:不可用
HttpURLConnection.getInputStream()行:不可用
XMLEntityManager.setupCurrentEntity(字符串、XMLInputSource、布尔值、布尔值)行:不可用
XMLEntityManager.Startenty(字符串、XMLInputSource、布尔值、布尔值)行:不可用
XMLEntityManager.StartDTEntity(XMLInputSource)行:不可用
XMLDTDScannerImpl.setInputSource(XMLInputSource)行:不可用
XMLDocumentScannerImpl$DTDDriver.dispatch(布尔)行:不可用
XMLDocumentScannerImpl$DTDDriver.next()行:不可用
XMLDocumentScannerImpl$PrologDriver.next()行:不可用
XMLNSDocumentScannerImpl(XMLDocumentScannerImpl).next()行:不可用
XMLNSDocumentScannerImpl.next()行:不可用
XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl)。扫描文档(布尔)行:不可用
XIncludeAwareParserConfiguration(XML11Configuration)。解析(布尔)行:不可用
XIncludeAwareParserConfiguration(XML11Configuration).解析(XMLInputSource)行:不可用
DOMParser(XMLParser).parse(XMLInputSource)行:不可用
DOMParser.parse(InputSource)行:不可用
DocumentBuilderImpl.parse(InputSource)行:不可用
DocumentBuilderImpl(DocumentBuilder).parse(InputStream)行:不可用
MSCommunicator.getInputStream(字符串)行:45
MSCommunicator.getGamePageFromForum(int,int,int)行:70
MSCommunicator.getGamePageFromForum(int,int)行:57
游戏。(整数,整数)行:21
MSCommunicator.main(字符串[])行:26
您不能期望将HTML解析为XML DOM树。它不一定是有效的XML。你可能需要先把它清理干净。请参见此问题的答案:
即使您获得的HTML页面是正确且格式良好的HTML,也可能不是格式良好的XML。对于exmaple,这在HTML4中有效:
<p class=myclass>Paragraph<br>Next line</p>
段落
下一行
而在XML(XHTML)中,这被认为是有效的:
<p class="myclass">Paragraph<br/>Next line</p>
段落
下一行
注意关闭的
标记和p
标记的class属性周围的引号
此外,互联网络是一个很疯狂的地方,所以内容不太可能是格式良好的,这就是为什么你需要“一刀切”——即使格式良好,也要使用更整洁的HTML,比如or