Java 将迭代器转换为带索引的for循环以跳过对象_Java_Html Parsing_Jericho Html Parser

Java 将迭代器转换为带索引的for循环以跳过对象

java

Java 将迭代器转换为带索引的for循环以跳过对象,java,html-parsing,jericho-html-parser,Java,Html Parsing,Jericho Html Parser,我正在使用解析一些格式错误的html。特别是，我尝试获取所有文本节点，处理文本，然后替换它我想跳过处理中的特定元素。例如，我想跳过所有元素，以及任何具有class=“noProcess”属性的元素。因此，如果一个div有class=“noProcess”，那么我想跳过这个div和所有子进程。但是，我确实希望这些被跳过的元素在处理后返回到输出 Jericho为所有节点提供了一个迭代器，但我不确定如何从迭代器中跳过完整的元素。这是我的密码： private String doProcessHtml

我正在使用解析一些格式错误的html。特别是，我尝试获取所有文本节点，处理文本，然后替换它

我想跳过处理中的特定元素。例如，我想跳过所有元素，以及任何具有class=“noProcess”属性的元素。因此，如果一个div有class=“noProcess”，那么我想跳过这个div和所有子进程。但是，我确实希望这些被跳过的元素在处理后返回到输出

Jericho为所有节点提供了一个迭代器，但我不确定如何从迭代器中跳过完整的元素。这是我的密码：

private String doProcessHtml(String html) {
        Source source = new Source(html);
        OutputDocument outputDocument = new OutputDocument(source);

        for (Segment segment : source) {
            if (segment instanceof Tag) {
                Tag tag = (Tag) segment;
                System.out.println("FOUND TAG: " + tag.getName());

                // DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"

            } else if (segment instanceof CharacterReference) {
                CharacterReference characterReference = (CharacterReference) segment;
                System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
            } else {
                System.out.println("FOUND PLAIN TEXT: " + segment.toString());
                outputDocument.replace(segment, doProcessText(segment.toString()));
            }
        }

       return outputDocument.toString();
    }

私有字符串doProcessHtml（字符串html）{
源代码=新源代码（html）；
OutputDocument OutputDocument=新的OutputDocument（源）；
对于（段：源）{
if（段实例of标记）{
标签=（标签）段；
System.out.println（“找到的标记：+TAG.getName（））；
//如果是或CLASS=“noProcess”，请在此处执行某些操作以跳过整个元素
}else if（字符引用的段实例）{
CharacterReference CharacterReference=（CharacterReference）段；
System.out.println（“找到CHARACTERREFERENCE:+CHARACTERREFERENCE.getCharacterReferenceString（））；
}否则{
System.out.println（“找到纯文本：+segment.toString（））；
outputDocument.replace（段，doProcessText（段.toString（）））；
}
}
返回outputDocument.toString（）；
}

对于我来说，使用ignoreWhenParsing（）方法并不合适，因为解析器只是将“忽略的”元素视为文本

我在想，如果我可以将迭代器循环转换为for（int I=0；…）循环，我可能可以跳过元素及其所有子元素，方法是修改I以指向EndTag，然后继续循环。。。。但不确定。

这应该行得通

String skipTag = null;
for (Segment segment : source) {
    if (skipTag != null) { // is skipping ON?
        if (segment instanceof EndTag && // if EndTag found for the
            skipTag.equals(((EndTag) segment).getName())) { // tag we're skipping
            skipTag = null; // set skipping OFF
        }
        continue; // continue skipping (or skip the EndTag)
    } else if (segment instanceof Tag) { // is tag?
        Tag tag = (Tag) segment;
        System.out.println("FOUND TAG: " + tag.getName());
        if (HTMLElementName.A.equals(tag.getName()) { // if <a> ?
            skipTag = tag.getName(); // set
            continue; // skipping ON
        } else if (tag instanceof StartTag) {
            if ("noProcess".equals( // if <tag class="noProcess" ..> ?
                    ((StartTag) tag).getAttributeValue("class"))) {
                skipTag = tag.getName(); // set
                continue; // skipping ON
            }
        }
    } // ...
}

String skipTag=null；
对于（段：源）{
如果（skipTag！=null）{//正在跳过？
if（EndTag的段实例&&//if为
skipTag.equals（（（EndTag）段）.getName（））{//tag我们正在跳过
skipTag=null；//将跳过设置为OFF
}
continue；//继续跳过（或跳过EndTag）
}else如果（标记的段实例）{//is标记？
标签=（标签）段；
System.out.println（“找到的标记：+TAG.getName（））；
if（HTMLElementName.A.equals（tag.getName（））{//if？
skipTag=tag.getName（）；//集
continue；//继续
}else if（标记StartTag的实例）{
if（“noProcess”。等于（//if？
（（StartTag）标记）.getAttributeValue（“类”））{
skipTag=tag.getName（）；//集
continue；//继续
}
}
} // ...
}
我想你可能会考虑重新设计你的段的构建方法。有没有一种方法来解析HTML，每个片段都是包含子元素的嵌套列表的父元素？这样你可以做一些类似的事情：
for (Segment segment : source) {
        if (segment instanceof Tag) {
            Tag tag = (Tag) segment;
            System.out.println("FOUND TAG: " + tag.getName());

            // DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"
            continue;

        } else if (segment instanceof CharacterReference) {
            CharacterReference characterReference = (CharacterReference) segment;
            System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
            for(Segment child : segment.childNodes()) {
                //Use recursion to process child elements
                //You will want to put your for loop in a separate method so it can be called recursively.
            }
        } else {
            System.out.println("FOUND PLAIN TEXT: " + segment.toString());
            outputDocument.replace(segment, doProcessText(segment.toString()));
        }
    }

for（段：源）{
if（段实例of标记）{
标签=（标签）段；
System.out.println（“找到的标记：+TAG.getName（））；
//如果是或CLASS=“noProcess”，请在此处执行某些操作以跳过整个元素
继续；
}else if（字符引用的段实例）{
CharacterReference CharacterReference=（CharacterReference）段；
System.out.println（“找到CHARACTERREFERENCE:+CHARACTERREFERENCE.getCharacterReferenceString（））；
对于（段子节点：段.childNodes（））{
//使用递归处理子元素
//您需要将for循环放在一个单独的方法中，以便可以递归地调用它。
}
}否则{
System.out.println（“找到纯文本：+segment.toString（））；
outputDocument.replace（段，doProcessText（段.toString（）））；
}
}

如果没有更多的代码来检查，就很难确定重新构造段元素是否可行或是否值得付出努力。
通过使用getEnd（）找到了一个有效的解决方案标记元素对象的方法。如果元素的结束位置小于您设置的位置，则跳过元素。因此，您可以找到要排除的元素的结束位置，并且不处理该位置之前的任何其他内容：
final ArrayList<String> excludeTags = new ArrayList<String>(Arrays.asList(new String[] {"head", "script", "a"}));
final ArrayList<String> excludeClasses = new ArrayList<String>(Arrays.asList(new String[] {"noProcess"}));

Source.LegacyIteratorCompatabilityMode = true;
Source source = new Source(htmlToProcess);
OutputDocument outputDocument = new OutputDocument(source);

int skipToPos = 0;
for (Segment segment : source) {
    if (segment.getBegin() >= skipToPos) {
        if (segment instanceof Tag) {
            Tag tag = (Tag) segment;
            Element element = tag.getElement();

            // check excludeTags
            if (excludeTags.contains(tag.getName().toLowerCase())) {
                skipToPos = element.getEnd();
            }

            // check excludeClasses
            String classes = element.getAttributeValue("class");
            if (classes != null) {
                for (String theClass : classes.split(" ")) {
                    if (excludeClasses.contains(theClass.toLowerCase())) {
                        skipToPos = element.getEnd();
                    }
                }
            }

        } else if (segment instanceof CharacterReference) { // for future use. Source.LegacyIteratorCompatabilityMode = true;
            CharacterReference characterReference = (CharacterReference) segment;
        } else {
            outputDocument.replace(segment, doProcessText(segment.toString()));
        }
    }
}

return outputDocument.toString();

final ArrayList excludeTags=new ArrayList（Arrays.asList（新字符串[]{“head”、“script”、“a”}））；
final ArrayList excludeClasses=new ArrayList（Arrays.asList（新字符串[]{“noProcess”}））；
Source.LegacyIteratorCompatabilityMode=true；
Source Source=新源（HTMLTOCESS）；
OutputDocument OutputDocument=新的OutputDocument（源）；
int-skipToPos=0；
对于（段：源）{
if（segment.getBegin（）>=skipToPos）{
if（段实例of标记）{
标签=（标签）段；
Element=tag.getElement（）；
//检查排除标记
if（excludeTags.contains（tag.getName（）.toLowerCase（）））{
skipToPos=element.getEnd（）；
}
//检查排除类
字符串类=element.getAttributeValue（“类”）；
if（类！=null）{
for（字符串类：classes.split（“”））{
if（excludeClasses.contains（class.toLowerCase（）））{
skipToPos=element.getEnd（）；
}
}
}
}else if（CharacterReference的段实例）{//以备将来使用。Source.LegacyIteratorCompatabilityMode=true；
CharacterReference CharacterReference=（CharacterReference）段；
}