Java 拆分jSoup刮取结果_Java_Web Scraping_Jsoup

Java 拆分jSoup刮取结果

java web-scraping

Java 拆分jSoup刮取结果,java,web-scraping,jsoup,Java,Web Scraping,Jsoup,我在Java上使用jSoup库进行抓取。我的源代码工作得很好，我想问一下如何分割我得到的每个元素这是我的消息来源 package javaapplication1; import java.io.IOException; import java.sql.SQLException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class coba { public static void main(

我在Java上使用jSoup库进行抓取。我的源代码工作得很好，我想问一下如何分割我得到的每个元素

这是我的消息来源

package javaapplication1;

import java.io.IOException;
import java.sql.SQLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class coba {

    public static void main(String[] args) throws SQLException  {
    MasukDB db=new MasukDB();        
        try {
            Document doc = null;
            for (int page = 1; page < 2; page++) {
                doc = Jsoup.connect("http://hackaday.com/page/" + page).get();
                System.out.println("title : " + doc.select(".entry-title>a").text() + "\n");
                System.out.println("link : " + doc.select(".entry-title>a").attr("href") + "\n");
                System.out.println("body : " + String.join("", doc.select(".entry-content p").text()) + "\n");
                System.out.println("date : " + doc.select(".entry-date>a").text() + "\n");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

结果，网站的每一页都变成了一行，怎么把它分开呢？至于如何获得每篇文章的链接，我认为我在链接端的CSS选择器仍然是错误的谢谢，伙计

 doc.select(".entry-title>a").text()

这将搜索整个文档并返回链接列表，您将从中删除它们的文本节点。然而，您可能希望从每一篇文章中获取相关数据

    Document doc;
    for (int page = 1; page < 2; page++) {

        doc = Jsoup.connect("http://hackaday.com/page/" + page).get();

        // get a list of articles on page
        Elements articles = doc.select("main#main article");

        // iterate article list
        for (Element article : articles) {

            // find the article header, which includes title and date
            Element header = article.select("header.entry-header").first();

            // find and scrape title/link from header
            Element headerTitle = header.select("h1.entry-title > a").first();
            String title = headerTitle.text();
            String link = headerTitle.attr("href");

            // find and scrape date from header
            String date = header.select("div.entry-meta > span.entry-date > a").text();

            // find and scrape every paragraph in the article content
            // you probably will want to further refine the logic here
            // there may be paragraphs you don't want to include
            String body = article.select("div.entry-content p").text();

            // view results
            System.out.println(
                    MessageFormat.format(
                            "title={0} link={1} date={2} body={3}", 
                            title, link, date, body));
        }
    }

有关如何获取此类数据的更多示例，请参见

这将搜索整个文档并返回链接列表，您将从中删除它们的文本节点。然而，您可能希望从每一篇文章中获取相关数据

    Document doc;
    for (int page = 1; page < 2; page++) {

        doc = Jsoup.connect("http://hackaday.com/page/" + page).get();

        // get a list of articles on page
        Elements articles = doc.select("main#main article");

        // iterate article list
        for (Element article : articles) {

            // find the article header, which includes title and date
            Element header = article.select("header.entry-header").first();

            // find and scrape title/link from header
            Element headerTitle = header.select("h1.entry-title > a").first();
            String title = headerTitle.text();
            String link = headerTitle.attr("href");

            // find and scrape date from header
            String date = header.select("div.entry-meta > span.entry-date > a").text();

            // find and scrape every paragraph in the article content
            // you probably will want to further refine the logic here
            // there may be paragraphs you don't want to include
            String body = article.select("div.entry-content p").text();

            // view results
            System.out.println(
                    MessageFormat.format(
                            "title={0} link={1} date={2} body={3}", 
                            title, link, date, body));
        }
    }

有关如何刮取此类数据的更多示例，请参阅。

非常感谢mate，您的脚本工作得非常好，与我的scrapy使用python几乎相同：D再次感谢mate，您的脚本工作得非常好，与我的scrapy使用python几乎相同：D再次感谢