Java 如何在包含诸如“坏”、“不值得”等特定词语的网页中提取段落内容_Java_Web Crawler_Jsoup_Html Parsing

Java 如何在包含诸如“坏”、“不值得”等特定词语的网页中提取段落内容

java web-crawler

Java 如何在包含诸如“坏”、“不值得”等特定词语的网页中提取段落内容,java,web-crawler,jsoup,html-parsing,Java,Web Crawler,Jsoup,Html Parsing,我正在尝试制作一个小的网络爬虫，它可以在网页上挑选产品的负面评论。我已经开发了一个代码，它可以搜索带有特定单词集的网页，并返回网页中是否存在这些单词。但是我需要选择包含这些词的整个评论内容。我正在使用jsoup获取页面的内容。我在下面提供我的代码，请建议我如何检索特定评论的全部数据，以及我如何将其推广到任何网页上以获取负面评论数据 import java.io.*; import java.nio.charset.StandardCharsets; import java.util.*; imp

我正在尝试制作一个小的网络爬虫，它可以在网页上挑选产品的负面评论。我已经开发了一个代码，它可以搜索带有特定单词集的网页，并返回网页中是否存在这些单词。但是我需要选择包含这些词的整个评论内容。我正在使用jsoup获取页面的内容。我在下面提供我的代码，请建议我如何检索特定评论的全部数据，以及我如何将其推广到任何网页上以获取负面评论数据

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupWordCount {

    public static void main(String[] args) throws IOException {
        long time = System.currentTimeMillis();

        List<String> contain = new ArrayList<String>();

        contain.add("bad");
        contain.add("horrible");
        contain.add("not satisfied");

        System.out.println("Downloading page...");
        Document doc = Jsoup
                .connect("http://www.amazon.in/Moto-Plus-4th-Gen-Black/product-reviews/B01DDP7GZK/ref=dpx_acr_txt?showViewpoints=1").get();

        // Get the actual text from the page, excluding the HTML
        String text = doc.body().text();

        System.out.println("Analyzing text...");
        // Create BufferedReader so the words can be counted
        BufferedReader reader = new BufferedReader(
                new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8))));
        String line;
        while ((line = reader.readLine()) != null) {
            String[] words = line.split("[^A-ZÃƒâ€¦Ãƒâ€žÃƒâ€“a-zÃƒÂ¥ÃƒÂ¤ÃƒÂ¶]+");
            List<String> words1 = new ArrayList<String>();
            words1 = Arrays.asList(words);

            if (contain.removeAll(words1))

            {
                System.out.println("The word is present in the document");
            } else {
                System.out.println("Noooooooo!");
            }
        }

        reader.close();
        time = System.currentTimeMillis() - time;

        System.out.println("Finished in " + time + " ms");
    }

}

import java.io.*；
导入java.nio.charset.StandardCharset；
导入java.util.*；
导入org.jsoup.jsoup；
导入org.jsoup.nodes.Document；
公共类JsoupWordCount{
公共静态void main（字符串[]args）引发IOException{
长时间=System.currentTimeMillis（）；
List contain=new ArrayList（）；
包含。添加（“坏”）；
包含。添加（“可怕”）；
包含。添加（“不满意”）；
System.out.println（“下载页面…”）；
文档doc=Jsoup
.连接（“http://www.amazon.in/Moto-Plus-4th-Gen-Black/product-reviews/B01DDP7GZK/ref=dpx_acr_txt?showViewpoints=1）.get（）；
//从页面中获取实际文本，不包括HTML
字符串text=doc.body（）.text（）；
System.out.println（“分析文本…”）；
//创建BufferedReader，以便可以计算字数
BufferedReader reader=新的BufferedReader(
新的InputStreamReader（新的ByteArrayInputStream（text.getBytes（StandardCharsets.UTF_8））；
弦线；
而（（line=reader.readLine（））！=null）{
String[]words=line.split（“[^A-ZÃ┱┱┱┱┱┱┱┱┱┱┱┱]+”；
List words1=new ArrayList（）；
words1=Arrays.asList（words）；
if（contain.removeAll（words1））
{
System.out.println（“该词出现在文档中”）；
}否则{
System.out.println（“nooooooo！”）；
}
}
reader.close（）；
时间=System.currentTimeMillis（）-时间；
System.out.println（“完成时间+时间+毫秒”）；
}
}

我猜你忘了在问题中添加代码…：-）嗨，琼斯，我现在添加了代码…：）