Java jsoup:解析每个h2标记之间的p标记的数据

Java jsoup:解析每个h2标记之间的p标记的数据,java,jsoup,Java,Jsoup,我正试图通过Java中的jsoup解析过去3天中的某些信息-\这是我的代码: Document document = Jsoup.connect(urlofpage).get(); Elements links = document.select(".contentBox"); for (Element link : links) { // String name = link.text();

我正试图通过Java中的jsoup解析过去3天中的某些信息-\这是我的代码:

   Document document = Jsoup.connect(urlofpage).get();
        Elements links = document.select(".contentBox");
         for (Element link : links) {
                 // String name = link.text();
                 String title = link.select("h2").text();
                 int h2length = link.select("h2").size();  

                 for( int i = 0; i <= h2length -1; i++)
                 {
                 String s = link.select("h2").get(i).text();
                  boolean desc1 = Pattern.compile("What is").matcher(s).find(); 
                  boolean desc2 = Pattern.compile("Uses for").matcher(s).find();

                if(desc1 == true || desc2 == true)
                    { 
                        String descritop = "";
                        int plength = link.select("p ~ h2 ~ p").size() - link.select("h2 ~ p").size();   
                        System.out.println(h2length); 
                        String ssv = link.select("h2 ~ p").get(1).text(); 
                     }
                 }
有什么解决办法吗


解析的URL是

我的想法很简单。获取h2元素之后的第一个p元素并将其添加到ArrayList中,然后检查下一个元素是否为p并将其添加。例如:

ArrayList<ArrayList<String>> textInsidePList = new ArrayList<ArrayList<String>>();
for (Element link : links) {
    Elements headings2 = link.select("h2 ~ p");
    for (int i = 0; i < headings2.size(); i++) {
        ArrayList<String> textInsideP = new ArrayList<String>(); 
        textInsideP.add(headings2.get(i).text());
        Element nextPar = headings2.get(i).nextElementSibling();
        if (nextPar.nodeName() == "p") {
            textInsideP.add(nextPar.text());
        }
        textInsidePList.add(textInsideP);
    }
}
控制台中的输出:

    h2:
    first h2
    first h2 content 1
    first h2 content 2
    first h2 content 3
    first h2 content 4
    h2:
    second h2
    second h2 content 1
    second h2 content 2
h2:
p:Vitamins A, D, and E topical (for the skin) is a skin protectant. It works by moisturizing and sealing the skin, and aids in skin healing.
p:This medication is used to treat diaper rash, dry or chafed skin, and minor cuts or burns.
p:Vitamins A, D, and E may also be used for purposes not listed in this medication guide.
h2:
p:You should not use this medication if your child is allergic to it. Do not apply vitamins A, D, and E topical without a rubber glove or finger cot if you are allergic this medication.
p:Ask a doctor or pharmacist if it is safe for you to use this medication on your child if the child is allergic to any medicines or skin products, including soaps, oils, lotions, or creams.
p:Stop using the medication and call your doctor at once if your child has a serious side effect such as warmth, redness, oozing, or severe irritation where the medicine is applied.
p:Keep the baby's diaper area as dry as possible. Change wet or soiled diapers immediately to keep wetness and bacteria from irritating the baby's skin. Always put on a new diaper when the baby first wakes up in the morning, and also just before putting the baby to bed each night.
使用递归是因为我们不知道在h2之前会遇到多少个“p”节点。ArrayList被用来代替数组,因为我们可以在那里动态添加元素,而无需设置数组的大小

编辑2,因为问题已更改:

public static void main(String[] args) throws IOException {
        Document document = Jsoup.connect(pathToYoursCusromUrl).get();
        Elements links = document.select(".contentBox");
        for (Element link : links) {
        /* creating first order ArrayList */
            ArrayList<ArrayList<String>> textInsidePList = new ArrayList<ArrayList<String>>();
            Elements headings2 = document.select("h2");
            for (Element heading2 : headings2) {
            /* creating second order ArrayList and adding data */

                ArrayList<String> textInsideP = new ArrayList<String>();
                parsingRecursion(heading2, textInsideP);
                textInsidePList.add(textInsideP);

            }

        /* iteraiting through ArrayList */
            for (ArrayList<String> firstH2 : textInsidePList) {
                System.out.println("h2:");
                for (String parsInsideH2 : firstH2) {
                    System.out.println("p:" + parsInsideH2);
                }
            }

        }
    }

    /* recursive function */
    private static void parsingRecursion(Element heading2, ArrayList<String> textInsideP) {
        Element nextPar = heading2.nextElementSibling();
        if (nextPar != null && nextPar.nodeName() == "p") {
            textInsideP.add(nextPar.text());
            parsingRecursion(nextPar, textInsideP);
        } else if (nextPar != null && nextPar.nodeName() != "h2") {
            Element nextNotP = nextPar.nextElementSibling();
            if (nextNotP != null) {
                textInsideP.add(nextNotP.text());
                parsingRecursion(nextNotP, textInsideP);
            }

        }
    }
}

诸如此类……

您好,谢谢您的回答:但是当您编写元素headings2=link时,这里有一些问题。请选择h2~p;它将获取页面中的所有标记元素,因此for循环将简单地将所有标记数据添加到一个数组中,它不会获取两个标记之间的数据,但感谢您的回答,我认为nextElementSibling;能解决我的问题吗嗯,真的?它必须在h2:ArrayList textinidep=new ArrayList;之后的每个新节点p上创建新的ArrayList。也许我真的不知道Jsoup选择器…是的,你的理解是正确的,它将在h2之后选择p节点,但是。。。它必须在下一个h2到来时停止,然后再次重复相同的操作:它在h2之后的每个节点p上创建新的ArrayList。这部分代码textInsidePList.addtextInsideP;将此ArrayList追加到另一个ArrayList。当我们没有p时,就会创建新的ArrayList。我们在ArrayList中有ArrayList of ArrayList你能确认你试过代码并且所有的p都在ArrayList中的一个ArrayList中吗?
    h2:
    first h2
    first h2 content 1
    first h2 content 2
    first h2 content 3
    first h2 content 4
    h2:
    second h2
    second h2 content 1
    second h2 content 2
public static void main(String[] args) throws IOException {
        Document document = Jsoup.connect(pathToYoursCusromUrl).get();
        Elements links = document.select(".contentBox");
        for (Element link : links) {
        /* creating first order ArrayList */
            ArrayList<ArrayList<String>> textInsidePList = new ArrayList<ArrayList<String>>();
            Elements headings2 = document.select("h2");
            for (Element heading2 : headings2) {
            /* creating second order ArrayList and adding data */

                ArrayList<String> textInsideP = new ArrayList<String>();
                parsingRecursion(heading2, textInsideP);
                textInsidePList.add(textInsideP);

            }

        /* iteraiting through ArrayList */
            for (ArrayList<String> firstH2 : textInsidePList) {
                System.out.println("h2:");
                for (String parsInsideH2 : firstH2) {
                    System.out.println("p:" + parsInsideH2);
                }
            }

        }
    }

    /* recursive function */
    private static void parsingRecursion(Element heading2, ArrayList<String> textInsideP) {
        Element nextPar = heading2.nextElementSibling();
        if (nextPar != null && nextPar.nodeName() == "p") {
            textInsideP.add(nextPar.text());
            parsingRecursion(nextPar, textInsideP);
        } else if (nextPar != null && nextPar.nodeName() != "h2") {
            Element nextNotP = nextPar.nextElementSibling();
            if (nextNotP != null) {
                textInsideP.add(nextNotP.text());
                parsingRecursion(nextNotP, textInsideP);
            }

        }
    }
}
h2:
p:Vitamins A, D, and E topical (for the skin) is a skin protectant. It works by moisturizing and sealing the skin, and aids in skin healing.
p:This medication is used to treat diaper rash, dry or chafed skin, and minor cuts or burns.
p:Vitamins A, D, and E may also be used for purposes not listed in this medication guide.
h2:
p:You should not use this medication if your child is allergic to it. Do not apply vitamins A, D, and E topical without a rubber glove or finger cot if you are allergic this medication.
p:Ask a doctor or pharmacist if it is safe for you to use this medication on your child if the child is allergic to any medicines or skin products, including soaps, oils, lotions, or creams.
p:Stop using the medication and call your doctor at once if your child has a serious side effect such as warmth, redness, oozing, or severe irritation where the medicine is applied.
p:Keep the baby's diaper area as dry as possible. Change wet or soiled diapers immediately to keep wetness and bacteria from irritating the baby's skin. Always put on a new diaper when the baby first wakes up in the morning, and also just before putting the baby to bed each night.