Java 从内部提取标记的实体<;p>;元素

Java 从内部提取标记的实体<;p>;元素,java,jsoup,Java,Jsoup,我的数据集具有以下结构: <p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions

我的数据集具有以下结构:

<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>

这两种方法都不管用。

这很管用,但不雅观

            //people
            Elements contents_person = doc.getElementsByTag("p").select("PERSON");

            for (Element content : contents_person) 
            {
                //String PERSON = content.attr("PERSON");
                String linkText = content.text();

                //print
                //System.out.println(PERSON);
                System.out.println(linkText);
            }

            //places
            Elements contents_place = doc.getElementsByTag("p").select("LOCATION");

            for (Element content : contents_place) 
            {
                //String PERSON = content.attr("PERSON");
                String linkText = content.text();

                //print
                //System.out.println(PERSON);
                System.out.println(linkText);
            }

            //things
            Elements contents_things = doc.getElementsByTag("p").select("ORGANIZATION");

            for (Element content : contents_things) 
            {
                //String PERSON = content.attr("PERSON");
                String linkText = content.text();

                //print
                //System.out.println(PERSON);
                System.out.println(linkText);
            }

您只需使用css选择器即可:

public class Foo {
    public static void main(String... args) {
        String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
        Document doc = Jsoup.parse(xml);

        for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
            System.out.printf("-> %s: %s\n", e.tagName(), e.text());
        }
    }
}
编辑:如果您想过滤掉这些标记并保留内容,您可以在迭代元素时将其替换为文本内容,如下所示:

public class Foo {
    public static void main(String... args) {
        String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
        Document doc = Jsoup.parse(xml);

        for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
            System.out.printf("-> %s: %s\n", e.tagName(), e.text());
            e.replaceWith(new TextNode(e.text(), ""));
        }

        System.out.println("\nFiltered out:\n" + doc.select("p").html());
    }
}
public class Foo {
    public static void main(String... args) {
        String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
        Document doc = Jsoup.parse(xml);

        for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
            System.out.printf("-> %s: %s\n", e.tagName(), e.text());
        }
    }
}
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward
public class Foo {
    public static void main(String... args) {
        String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
        Document doc = Jsoup.parse(xml);

        for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
            System.out.printf("-> %s: %s\n", e.tagName(), e.text());
            e.replaceWith(new TextNode(e.text(), ""));
        }

        System.out.println("\nFiltered out:\n" + doc.select("p").html());
    }
}
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward

Filtered out:
The Peter Hall Company's production of ''Blithe Spirit,'' directed by Thea Sharrock, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for Penelope Keith's startlingly brisk and no-nonsense interpretation of the madcap medium Madame Arcati, Ms. Sharrock's take on Coward's 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.