从HTTP输入流构建时，JavaXXML解析器卡住_Java_Html_Xml_Parsing_Well Formed

从HTTP输入流构建时，JavaXXML解析器卡住

java html xml parsing

从HTTP输入流构建时，JavaXXML解析器卡住,java,html,xml,parsing,well-formed,Java,Html,Xml,Parsing,Well Formed,我试图打开一个与网站的HTTP连接，并将html解析为org.w3c.dom.Document类。我可以打开HTTP连接并将网页输出到控制台，但如果我将InputStream对象传递给XML解析器，它会挂起一分钟并输出错误 [Fatal Error] :108:55: Open quote is expected for attribute "{1}" associated with an element type "onload". 代码： private static Document

我试图打开一个与网站的HTTP连接，并将html解析为

org.w3c.dom.Document

类。我可以打开HTTP连接并将网页输出到控制台，但如果我将InputStream对象传递给XML解析器，它会挂起一分钟并输出错误

[Fatal Error] :108:55: Open quote is expected for attribute "{1}" associated with an  element type  "onload".

代码：

private static Document getInputStream(String url) throws IOException, SAXException, ParserConfigurationException
{
  System.out.println(url);
  URL webUrl = new URL(url);
  URLConnection connection = webUrl.openConnection();
  connection.setConnectTimeout(60 * 1000);
  connection.setReadTimeout(60 * 1000);

  InputStream stream = connection.getInputStream();

  DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
  domFactory.setNamespaceAware(true);
  DocumentBuilder builder = domFactory.newDocumentBuilder();
  Document doc = builder.parse(stream); // This line is hanging
  return doc;
}

暂停时的堆栈跟踪：

Thread [main] (Suspended)   
    SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]    
    SocketInputStream.read(byte[], int, int) line: not available    
    BufferedInputStream.fill() line: not available  
    BufferedInputStream.read1(byte[], int, int) line: not available 
    BufferedInputStream.read(byte[], int, int) line: not available  
    HttpClient.parseHTTPHeader(MessageHeader, ProgressSource, HttpURLConnection) line: not available    
    HttpClient.parseHTTP(MessageHeader, ProgressSource, HttpURLConnection) line: not available  
    HttpURLConnection.getInputStream() line: not available  
    XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean) line: not available   
    XMLEntityManager.startEntity(String, XMLInputSource, boolean, boolean) line: not available  
    XMLEntityManager.startDTDEntity(XMLInputSource) line: not available 
    XMLDTDScannerImpl.setInputSource(XMLInputSource) line: not available    
    XMLDocumentScannerImpl$DTDDriver.dispatch(boolean) line: not available  
    XMLDocumentScannerImpl$DTDDriver.next() line: not available 
    XMLDocumentScannerImpl$PrologDriver.next() line: not available  
    XMLNSDocumentScannerImpl(XMLDocumentScannerImpl).next() line: not available 
    XMLNSDocumentScannerImpl.next() line: not available 
    XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl).scanDocument(boolean) line: not available  
    XIncludeAwareParserConfiguration(XML11Configuration).parse(boolean) line: not available 
    XIncludeAwareParserConfiguration(XML11Configuration).parse(XMLInputSource) line: not available  
    DOMParser(XMLParser).parse(XMLInputSource) line: not available  
    DOMParser.parse(InputSource) line: not available    
    DocumentBuilderImpl.parse(InputSource) line: not available  
    DocumentBuilderImpl(DocumentBuilder).parse(InputStream) line: not available 
    MSCommunicator.getInputStream(String) line: 45  
    MSCommunicator.getGamePageFromForum(int, int, int) line: 70 
    MSCommunicator.getGamePageFromForum(int, int) line: 57  
    Game.<init>(int, int) line: 21  
    MSCommunicator.main(String[]) line: 26

Thread[main]（挂起）
SocketInputStream.socketRead0（文件描述符，字节[]，int，int，int）行：不可用[本机方法]
SocketInputStream.read（字节[]，int，int）行：不可用
BufferedInputStream.fill（）行：不可用
BufferedInputStream.read1（字节[]，int，int）行：不可用
BufferedInputStream.read（字节[]，int，int）行：不可用
HttpClient.parseHTTPHeader（MessageHeader、ProgressSource、HttpURLConnection）行：不可用
HttpClient.parseHTTP（MessageHeader，ProgressSource，HttpURLConnection）行：不可用
HttpURLConnection.getInputStream（）行：不可用
XMLEntityManager.setupCurrentEntity（字符串、XMLInputSource、布尔值、布尔值）行：不可用
XMLEntityManager.Startenty（字符串、XMLInputSource、布尔值、布尔值）行：不可用
XMLEntityManager.StartDTEntity（XMLInputSource）行：不可用
XMLDTDScannerImpl.setInputSource（XMLInputSource）行：不可用
XMLDocumentScannerImpl$DTDDriver.dispatch（布尔）行：不可用
XMLDocumentScannerImpl$DTDDriver.next（）行：不可用
XMLDocumentScannerImpl$PrologDriver.next（）行：不可用
XMLNSDocumentScannerImpl（XMLDocumentScannerImpl）.next（）行：不可用
XMLNSDocumentScannerImpl.next（）行：不可用
XMLNSDocumentScannerImpl（XMLDocumentFragmentScannerImpl）。扫描文档（布尔）行：不可用
XIncludeAwareParserConfiguration（XML11Configuration）。解析（布尔）行：不可用
XIncludeAwareParserConfiguration（XML11Configuration）.解析（XMLInputSource）行：不可用
DOMParser（XMLParser）.parse（XMLInputSource）行：不可用
DOMParser.parse（InputSource）行：不可用
DocumentBuilderImpl.parse（InputSource）行：不可用
DocumentBuilderImpl（DocumentBuilder）.parse（InputStream）行：不可用
MSCommunicator.getInputStream（字符串）行：45
MSCommunicator.getGamePageFromForum（int，int，int）行：70
MSCommunicator.getGamePageFromForum（int，int）行：57
游戏。（整数，整数）行：21
MSCommunicator.main（字符串[]）行：26

您不能期望将HTML解析为XML DOM树。它不一定是有效的XML。你可能需要先把它清理干净。请参见此问题的答案：

即使您获得的HTML页面是正确且格式良好的HTML，也可能不是格式良好的XML。对于exmaple，这在HTML4中有效：

<p class=myclass>Paragraph<br>Next line</p>

段落
下一行

而在XML（XHTML）中，这被认为是有效的：

<p class="myclass">Paragraph<br/>Next line</p>

段落
下一行

注意关闭的

标记和

标记的class属性周围的引号

此外，互联网络是一个很疯狂的地方，所以内容不太可能是格式良好的，这就是为什么你需要“一刀切”——即使格式良好，也要使用更整洁的HTML，比如or