Java 拆分jSoup刮取结果
我在Java上使用jSoup库进行抓取。我的源代码工作得很好,我想问一下如何分割我得到的每个元素 这是我的消息来源Java 拆分jSoup刮取结果,java,web-scraping,jsoup,Java,Web Scraping,Jsoup,我在Java上使用jSoup库进行抓取。我的源代码工作得很好,我想问一下如何分割我得到的每个元素 这是我的消息来源 package javaapplication1; import java.io.IOException; import java.sql.SQLException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class coba { public static void main(
package javaapplication1;
import java.io.IOException;
import java.sql.SQLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class coba {
public static void main(String[] args) throws SQLException {
MasukDB db=new MasukDB();
try {
Document doc = null;
for (int page = 1; page < 2; page++) {
doc = Jsoup.connect("http://hackaday.com/page/" + page).get();
System.out.println("title : " + doc.select(".entry-title>a").text() + "\n");
System.out.println("link : " + doc.select(".entry-title>a").attr("href") + "\n");
System.out.println("body : " + String.join("", doc.select(".entry-content p").text()) + "\n");
System.out.println("date : " + doc.select(".entry-date>a").text() + "\n");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
结果,网站的每一页都变成了一行,怎么把它分开呢?至于如何获得每篇文章的链接,我认为我在链接端的CSS选择器仍然是错误的
谢谢,伙计
doc.select(".entry-title>a").text()
这将搜索整个文档并返回链接列表,您将从中删除它们的文本节点。然而,您可能希望从每一篇文章中获取相关数据
Document doc;
for (int page = 1; page < 2; page++) {
doc = Jsoup.connect("http://hackaday.com/page/" + page).get();
// get a list of articles on page
Elements articles = doc.select("main#main article");
// iterate article list
for (Element article : articles) {
// find the article header, which includes title and date
Element header = article.select("header.entry-header").first();
// find and scrape title/link from header
Element headerTitle = header.select("h1.entry-title > a").first();
String title = headerTitle.text();
String link = headerTitle.attr("href");
// find and scrape date from header
String date = header.select("div.entry-meta > span.entry-date > a").text();
// find and scrape every paragraph in the article content
// you probably will want to further refine the logic here
// there may be paragraphs you don't want to include
String body = article.select("div.entry-content p").text();
// view results
System.out.println(
MessageFormat.format(
"title={0} link={1} date={2} body={3}",
title, link, date, body));
}
}
有关如何获取此类数据的更多示例,请参见
这将搜索整个文档并返回链接列表,您将从中删除它们的文本节点。然而,您可能希望从每一篇文章中获取相关数据
Document doc;
for (int page = 1; page < 2; page++) {
doc = Jsoup.connect("http://hackaday.com/page/" + page).get();
// get a list of articles on page
Elements articles = doc.select("main#main article");
// iterate article list
for (Element article : articles) {
// find the article header, which includes title and date
Element header = article.select("header.entry-header").first();
// find and scrape title/link from header
Element headerTitle = header.select("h1.entry-title > a").first();
String title = headerTitle.text();
String link = headerTitle.attr("href");
// find and scrape date from header
String date = header.select("div.entry-meta > span.entry-date > a").text();
// find and scrape every paragraph in the article content
// you probably will want to further refine the logic here
// there may be paragraphs you don't want to include
String body = article.select("div.entry-content p").text();
// view results
System.out.println(
MessageFormat.format(
"title={0} link={1} date={2} body={3}",
title, link, date, body));
}
}
有关如何刮取此类数据的更多示例,请参阅。非常感谢mate,您的脚本工作得非常好,与我的scrapy使用python几乎相同:D再次感谢mate,您的脚本工作得非常好,与我的scrapy使用python几乎相同:D再次感谢