Html 如何在使用XmlSlurper时查找有问题的行_Html_Groovy_Xerces_Xmlslurper

Html 如何在使用XmlSlurper时查找有问题的行

html groovy

Html 如何在使用XmlSlurper时查找有问题的行,html,groovy,xerces,xmlslurper,Html,Groovy,Xerces,Xmlslurper,我正在使用XmlSlurper解析脏html页面，出现以下错误： ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>". at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apach

我正在使用XmlSlurper解析脏html页面，出现以下错误：

ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        ...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".

现在，我有了html，我给它添加了html，并在这样做之前将其打印出来。如果我打开它，并尝试转到错误1157中提到的行，则其中没有“src”（但文件中有数百个这样的字符串）。所以我猜会插入一些额外的东西（可能是

或类似的东西）来改变行号

有没有一种很好的方法可以准确地找到有问题的行或html片段？

您使用的是哪种SAXParser？HTML不是严格的XML，因此将XMLSlurper与默认解析器一起使用可能会导致持续的错误

在谷歌上粗略搜索“Groovy html slurper”，我找到了一个名为SaxParser的SaxParser

旋转一下，看看它是否解析脏页。

您使用的是哪种SAXParser？HTML不是严格的XML，因此将XMLSlurper与默认解析器一起使用可能会导致持续的错误

在谷歌上粗略搜索“Groovy html slurper”，我找到了一个名为SaxParser的SaxParser

旋转一下，看看它是否解析脏页。

您可以向每个元素添加一个名为_lineNum的属性，然后可以使用该属性

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.ext.Attributes2Impl;
import javax.xml.parsers.ParserConfigurationException;

class MySlurper extends XmlSlurper {    
    public static final String LINE_NUM_ATTR = "_srmLineNum"
    Locator locator

    public MySlurper() throws ParserConfigurationException, SAXException {
        super();
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        Attributes2Impl newAttrs = new Attributes2Impl(attrs);        
        newAttrs.addAttribute(uri, LINE_NUM_ATTR, LINE_NUM_ATTR, "ENTITY", "" + locator.getLineNumber());        
        super.startElement(uri, localName, qName, newAttrs);
    }
}

def text = '''
<root>
  <a>one!</a>
  <a>two!</a>
</root>'''

def root = new MySlurper().parseText(text)

root.a.each { println it.@_srmLineNum }

import org.xml.sax.Attributes；
导入org.xml.sax.Locator；
导入org.xml.sax.SAXException；
导入org.xml.sax.ext.Attributes2Impl；
导入javax.xml.parsers.parserConfiguration异常；
类MySlurper扩展了XmlSlurper{
公共静态最终字符串行_NUM_ATTR=“\u srmLineNum”
定位器
public MySlurper（）抛出ParserConfiguration异常，SAXException{
超级（）；
}
@凌驾
公共无效setDocumentLocator（定位器定位器）{
this.locator=定位器；
}
@凌驾
public void startElement（字符串uri、字符串localName、字符串qName、属性attrs）引发SAXException{
Attributes2Impl newAttrs=新属性2impl（attrs）；
newAttrs.addAttribute（uri、LINE_NUM_ATTR、LINE_NUM_ATTR、“实体”、“+locator.getLineNumber（））；
startElement（uri、localName、qName、newAttrs）；
}
}
定义文本=“”
一
二
'''
def root=new MySlurper（）.parseText（文本）
root.a.each{println it.@\u srmLineNum}

上面添加了line num属性。您可以尝试设置自己的错误处理程序，该程序可以从定位器读取行号

您可以向每个元素添加一个名为_lineNum的属性，然后可以使用该属性

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.ext.Attributes2Impl;
import javax.xml.parsers.ParserConfigurationException;

class MySlurper extends XmlSlurper {    
    public static final String LINE_NUM_ATTR = "_srmLineNum"
    Locator locator

    public MySlurper() throws ParserConfigurationException, SAXException {
        super();
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        Attributes2Impl newAttrs = new Attributes2Impl(attrs);        
        newAttrs.addAttribute(uri, LINE_NUM_ATTR, LINE_NUM_ATTR, "ENTITY", "" + locator.getLineNumber());        
        super.startElement(uri, localName, qName, newAttrs);
    }
}

def text = '''
<root>
  <a>one!</a>
  <a>two!</a>
</root>'''

def root = new MySlurper().parseText(text)

root.a.each { println it.@_srmLineNum }

import org.xml.sax.Attributes；
导入org.xml.sax.Locator；
导入org.xml.sax.SAXException；
导入org.xml.sax.ext.Attributes2Impl；
导入javax.xml.parsers.parserConfiguration异常；
类MySlurper扩展了XmlSlurper{
公共静态最终字符串行_NUM_ATTR=“\u srmLineNum”
定位器
public MySlurper（）抛出ParserConfiguration异常，SAXException{
超级（）；
}
@凌驾
公共无效setDocumentLocator（定位器定位器）{
this.locator=定位器；
}
@凌驾
public void startElement（字符串uri、字符串localName、字符串qName、属性attrs）引发SAXException{
Attributes2Impl newAttrs=新属性2impl（attrs）；
newAttrs.addAttribute（uri、LINE_NUM_ATTR、LINE_NUM_ATTR、“实体”、“+locator.getLineNumber（））；
startElement（uri、localName、qName、newAttrs）；
}
}
定义文本=“”
一
二
'''
def root=new MySlurper（）.parseText（文本）
root.a.each{println it.@\u srmLineNum}

上面添加了line num属性。您可以尝试设置自己的错误处理程序，该程序可以从定位器读取行号

这个错误提到“scr”，你是说你找不到“src”。这是打字错误，还是你在文档中搜索错误的东西？在找到NekoHTML之前，我也在使用TagSoup。我记不起确切的原因了，但TagSoup就是不起作用。您可以在这里看到一个如何使用NekoHTML的示例-。错误提到“scr”，您是说您找不到“src”。这是打字错误，还是你在文档中搜索错误的东西？在找到NekoHTML之前，我也在使用TagSoup。我记不起确切的原因了，但TagSoup就是不起作用。你可以在这里看到一个如何使用NekoHTML的例子-。谢谢，我已经试过Tagsoup了，但一无所获。直到几天前，当我接收的页面发生了一些变化时，我的代码在使用带有默认解析器的XmlSlurper时运行良好。在使用XmlSlurper之前，我自己通过编写代码来解决问题，问题是我现在找不到问题所在……我接受这一点，尽管这不是我问题的答案。但是我又试了一次Tagsoup，这次效果很好。谢谢，我已经试过Tagsoup了，但一无所获。直到几天前，当我接收的页面发生了一些变化时，我的代码在使用带有默认解析器的XmlSlurper时运行良好。在使用XmlSlurper之前，我自己通过编写代码来解决问题，问题是我现在找不到问题所在……我接受这一点，尽管这不是我问题的答案。但我又试了一次，这次效果很好