Java/JSoup纯文本提取和存储_Java_Html_Regex_String_Jsoup

Java/JSoup纯文本提取和存储

java html regex string

Java/JSoup纯文本提取和存储,java,html,regex,string,jsoup,Java,Html,Regex,String,Jsoup,我正在努力解决以下问题假设我有一个HTML文件，其内容如下：这将获得每个单独的p标记，但是我希望p标记保留在各自的div中，并将每个div存储到字符串/字符串数组变量中。您需要遍历文档body->div->p，而不是body->p Elements divs = htmlFile.select("body div"); //initialize div map here for(Element div : divs) { Elements paras = div.getElemen

我正在努力解决以下问题

假设我有一个HTML文件，其内容如下：

这将获得每个单独的p标记，但是我希望p标记保留在各自的div中，并将每个div存储到字符串/字符串数组变量中。

您需要遍历文档

body->div->p

，而不是

body->p

Elements divs = htmlFile.select("body div");
//initialize div map here
for(Element div : divs) {
    Elements paras = div.getElementsByTag("p");
    for(Element para : paras) {
       String text = para.text();
    }
}

在遍历时，可以根据需要将其存储在任何数据结构中。希望这有帮助

这将把包含p标记的div标记放入字符串列表中

public class Main {
  public static void main(String[] args) throws IOException {
    File html = new File("src/main/resources/markup.html");
    Document doc = Jsoup.parse(html, "UTF-8");
    //all div tags wrapping a p tag
    Elements divs = doc.select("div:has(p)");
    //put the divs into a list
    List<String> list = new ArrayList<String>();
    for (Element div : divs) {
      list.add(div.toString());
      System.out.println(div + "\n");
    }
  }
}

公共类主{
公共静态void main（字符串[]args）引发IOException{
文件html=新文件（“src/main/resources/markup.html”）；
documentdoc=Jsoup.parse（html，“UTF-8”）；
//包装p标记的所有div标记
元素divs=doc.select（“div:has（p）”；
//将div放入一个列表中
列表=新的ArrayList（）；
用于（元素div:divs）{
list.add（div.toString（））；
System.out.println（div+“\n”）；
}
}
}

markup.html

<!DOCTYPE html>
<head>
  <meta charset="UTF-8" />
  <title>whatever</title>
</head>

<body>
  <div class=nameCouldBeAnything0>
    <p>some text here</p>
  </div>

  <div class=nameCouldBeAnything1></div>

  <div class=nameCouldBeAnything2>
    <p>some more text here</p>
  </div>

  <div class=nameCouldBeAnything3>
    <p>even more text here</p>
    <p>and here</p>
    <p>and here</p>
    <p>and here</p>
    <p>and here</p>
  </div>

  <div class=nameCouldBeAnything4>
    <span>even more text here</span>
  </div>
</body>
</html>


无论什么
这里有一些文字
这里有更多的文字
这里有更多的文字
这里呢
这里呢
这里呢
这里呢
这里有更多的文字

输出

<div class="nameCouldBeAnything0"> 
  <p>some text here</p> 
</div>

<div class="nameCouldBeAnything2"> 
  <p>some more text here</p> 
</div>

<div class="nameCouldBeAnything3"> 
  <p>even more text here</p> 
  <p>and here</p> 
  <p>and here</p> 
  <p>and here</p> 
  <p>and here</p> 
</div>


这里有一些文字
这里有更多的文字
这里有更多的文字
在这里
在这里
在这里
在这里

我能够解决我的困境

这是我使用的代码，希望它能帮助有需要的人

感谢所有发帖子的人

public static ArrayList proc(Document htmlFile)
{
    Elements body = htmlFile.select("body");
    ArrayList HTMLPlainText = new ArrayList();

    HTMLPlainText.add(htmlFile.title());

    for(Iterator<Element> it = body.iterator(); it.hasNext();)
    {
        Element pBody = it.next();
        Elements. pTag = pBody.getElementsByTag("p");parents();

            for(int pTagCount = 0; pTagCount < pTag.size(); pTagCount++)
            {
                Element p = pTag.get(pTagCount);
                String pt = p.text();

                if(pt.length() != 0)
                {
                    HTMLPainText.add(pt);
                    pTagCount++:
                }

                pTag.parents().empty();     

            }
    }
}

publicstaticarraylistproc（documenthtmlfile）
{
Elements body=htmlFile.select（“body”）；
ArrayList HTMLPlainText=新的ArrayList（）；
添加（htmlFile.title（））；
for（Iterator it=body.Iterator（）；it.hasNext（）；）
{
元素pBody=it.next（）；
Elements.pTag=pBody.getElementsByTag（“p”）；parents（）；
对于（int pTagCount=0；pTagCount


注意，可能有一些语法错误，我手动输入了此内容。
为回复干杯，明天我会告诉您进展如何！由于某些原因，：has（p）不仅返回包含p的div。它将抓取html文件中的所有div。谢谢你的建议。对我有用，更新了更多信息。这不是你要求的输出吗？谢谢你的投入，我能够解决我的问题。看看我的帖子。为回复干杯，明天我会告诉你进展如何！存在不兼容类型错误，这发生在第二个for循环中。这应该是读取元素而不是元素吗？你是对的。。在第二个for循环中，您需要使用Element
而不是Elements，现已更新。感谢您的输入，我能够解决此问题。看看我的帖子。
<div class="nameCouldBeAnything0"> 
  <p>some text here</p> 
</div>

<div class="nameCouldBeAnything2"> 
  <p>some more text here</p> 
</div>

<div class="nameCouldBeAnything3"> 
  <p>even more text here</p> 
  <p>and here</p> 
  <p>and here</p> 
  <p>and here</p> 
  <p>and here</p> 
</div>

public static ArrayList proc(Document htmlFile)
{
    Elements body = htmlFile.select("body");
    ArrayList HTMLPlainText = new ArrayList();

    HTMLPlainText.add(htmlFile.title());

    for(Iterator<Element> it = body.iterator(); it.hasNext();)
    {
        Element pBody = it.next();
        Elements. pTag = pBody.getElementsByTag("p");parents();

            for(int pTagCount = 0; pTagCount < pTag.size(); pTagCount++)
            {
                Element p = pTag.get(pTagCount);
                String pt = p.text();

                if(pt.length() != 0)
                {
                    HTMLPainText.add(pt);
                    pTagCount++:
                }

                pTag.parents().empty();     

            }
    }
}