Java Jsoup parse/craw包含特定字符串
获取链接包含特定字符串: 这是我的节目:Java Jsoup parse/craw包含特定字符串,java,parsing,web-crawler,jsoup,Java,Parsing,Web Crawler,Jsoup,获取链接包含特定字符串: 这是我的节目: package com.dcvsolution.crawler.songkhoedotvn; import java.io.IOException; import java.util.HashSet; import java.util.Set; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsou
package com.dcvsolution.crawler.songkhoedotvn;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.dcvsolution.commons.CommonUtil;
public class GetCategories {
public static void main(String[] args) throws IOException {
// Bait link.
String baitLink = "http://songkhoe.vn/";
// If we prefer mobile-site option (default of "songkhoe.vn" when use
// Jsoup).
// Document doc = Jsoup.connect(url).get();
// If we prefer full-page format web-page option.
// Đánh lừa trang web, đây là đang duyệt web trên PC, là Google.
Document doc = Jsoup.connect(baitLink)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0")
.referrer("http://www.google.com").get();
Elements links = doc.select("a[href]");
// Push data to a set for unique result.
Set<Element> uniqueSet = new HashSet<Element>(links);
// Iterate through above set.
for (Element link : uniqueSet) {
CommonUtil.printByFoundry("%s", link.attr("abs:href"), CommonUtil.trim(link.text(), 3500));
// Posts post = new Posts();
// post.setTitle(link.attr("abs:href"));
// post.setContent(link.attr("abs:href"));
// post.setCreateTimeDb(new Date());
// Session session =
// HibernateUtil.getSessionFactory().openSession();
// session.beginTransaction();
// session.save(post);
// session.getTransaction().commit();
}
}
}
请帮助我选择仅包含字符串“/chuyen muc-”的字符串
我尝试:
Elements links = doc.select("a[href]:contains(chuyen-muc-)");
但不起作用请改用此CSS查询:
a[href*=chuyen-muc-]
使用正则表达式可以更具体:
a[href~=chuyen-muc-(dinh|lam)]
上面的CSS查询将只返回包含
chuyen muc dinh
或chuyen muc lam
的链接。您能帮我排除一些特定链接吗,例如:http://songkhoe.vn/chuyen-muc-video-cuoi-s27-0.html
xpath表达式不是cssquery@dovy可以使用正则表达式。如果您一直使用正则表达式,请发布另一个问题。@dovy我已经用正则表达式示例更新了我的答案。