Java 从内部提取标记的实体<;p>;元素
我的数据集具有以下结构:Java 从内部提取标记的实体<;p>;元素,java,jsoup,Java,Jsoup,我的数据集具有以下结构: <p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions
<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>
这两种方法都不管用。这很管用,但不雅观
//people
Elements contents_person = doc.getElementsByTag("p").select("PERSON");
for (Element content : contents_person)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
//places
Elements contents_place = doc.getElementsByTag("p").select("LOCATION");
for (Element content : contents_place)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
//things
Elements contents_things = doc.getElementsByTag("p").select("ORGANIZATION");
for (Element content : contents_things)
{
//String PERSON = content.attr("PERSON");
String linkText = content.text();
//print
//System.out.println(PERSON);
System.out.println(linkText);
}
您只需使用css选择器即可:
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
}
}
}
编辑:如果您想过滤掉这些标记并保留内容,您可以在迭代元素时将其替换为文本内容,如下所示:
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
e.replaceWith(new TextNode(e.text(), ""));
}
System.out.println("\nFiltered out:\n" + doc.select("p").html());
}
}
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
}
}
}
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward
public class Foo {
public static void main(String... args) {
String xml = "<p>The <ORGANIZATION>Peter Hall Company</ORGANIZATION>'s production of ''Blithe Spirit,'' directed by <PERSON>Thea Sharrock</PERSON>, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for <PERSON>Penelope Keith</PERSON>'s startlingly brisk and no-nonsense interpretation of the madcap medium <ORGANIZATION>Madame Arcati</ORGANIZATION>, Ms. <PERSON>Sharrock</PERSON>'s take on <PERSON>Coward</PERSON>'s 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.</p>";
Document doc = Jsoup.parse(xml);
for (Element e: doc.select("p > ORGANIZATION, p > PERSON")) {
System.out.printf("-> %s: %s\n", e.tagName(), e.text());
e.replaceWith(new TextNode(e.text(), ""));
}
System.out.println("\nFiltered out:\n" + doc.select("p").html());
}
}
-> organization: Peter Hall Company
-> person: Thea Sharrock
-> person: Penelope Keith
-> organization: Madame Arcati
-> person: Sharrock
-> person: Coward
Filtered out:
The Peter Hall Company's production of ''Blithe Spirit,'' directed by Thea Sharrock, is one of those attractively and unimaginatively upholstered productions of brittle classics that become must-have middlebrow tickets every few years. Most notable for Penelope Keith's startlingly brisk and no-nonsense interpretation of the madcap medium Madame Arcati, Ms. Sharrock's take on Coward's 1941 comedy of a man visited by his dead wife's impish spirit delivers bright badinage, dazed double takes and marital melees at the same efficient clip.