Java Android使用Jsoup解析嵌套表_Java_Web Scraping_Jsoup

Java Android使用Jsoup解析嵌套表

java web-scraping

Java Android使用Jsoup解析嵌套表,java,web-scraping,jsoup,Java,Web Scraping,Jsoup,我试图在线解析一个HTML页面，用Jsoup从表中检索数据。我要分析的页面包含多个表我该怎么做下面是我要分析的示例页面：我要提取的数据是模型名和详细信息页面的URL try { /** * Works to iterate through the items at the following website * https://www.cpu-world.com/C

我试图在线解析一个HTML页面，用Jsoup从表中检索数据。我要分析的页面包含多个表

我该怎么做

下面是我要分析的示例页面：

我要提取的数据是模型名和详细信息页面的URL

            try {
                /**
                 * Works to iterate through the items at the following website
                 * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
                 */
                URL url = new URL("https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html");
                
                Document doc = Jsoup.parse(url, 3000);
                
                // spec_table is the name of the class associated with the table
                Elements table = doc.select("table.spec_table");
                Elements rows = table.select("tr");
                
                Iterator<Element> rowIterator = rows.iterator();
                rowIterator.next();
                boolean wasMatch = false;
                
                // Loop through all items in list
                while (rowIterator.hasNext()) {
                    Element row = rowIterator.next();
                    Elements cols = row.select("td");
                    String rowName = cols.get(0).text();
                }
            } catch (MalformedURLException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }

编辑：

这是我用来从细节页面提取数据的一些代码

            try {
                /**
                 * Works to iterate through the items at the following website
                 * https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
                 */
                URL url = new URL("https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html");
                
                Document doc = Jsoup.parse(url, 3000);
                
                // spec_table is the name of the class associated with the table
                Elements table = doc.select("table.spec_table");
                Elements rows = table.select("tr");
                
                Iterator<Element> rowIterator = rows.iterator();
                rowIterator.next();
                boolean wasMatch = false;
                
                // Loop through all items in list
                while (rowIterator.hasNext()) {
                    Element row = rowIterator.next();
                    Elements cols = row.select("td");
                    String rowName = cols.get(0).text();
                }
            } catch (MalformedURLException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }

试试看{
/**
*在以下网站上迭代项目
* https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
*/
URL=新URL（“https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html);；
documentdoc=Jsoup.parse（url，3000）；
//spec_table是与表关联的类的名称
元素表=文件选择（“表规范表”）；
元素行=表。选择（“tr”）；
迭代器rowIterator=行。迭代器（）；
roweiterator.next（）；
布尔值wasMatch=false；
//循环浏览列表中的所有项目
while（roweiterator.hasNext（））{
元素行=行迭代器。下一步（）；
元素cols=行。选择（“td”）；
字符串rowName=cols.get（0.text（）；
}
}捕获（格式错误）{
e、 printStackTrace（）；
}捕获（IOE异常）{
e、 printStackTrace（）；
}

我一直在阅读一些教程和文档，但我似乎不知道如何浏览网页来提取我正在寻找的数据。我理解HTML和CSS，但我只是在学习Jsoup

（我将其标记为Android，因为我使用的是Java代码。我想没有必要这么具体。）

这看起来像是您想要的：

import org.jsoup.jsoup；
导入org.jsoup.nodes.Document；
导入java.io.IOException；
导入java.net.URL；
公共类CpuWorld{
公共静态void main（字符串[]args）引发IOException{
URL=null；
试一试{
/**
*在以下网站上迭代项目
* https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
*/
url=新url（“https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html);；
}捕获（IOE异常）{
e、 printStackTrace（）；
}
documentdoc=Jsoup.parse（url，3000）；
//spec_table是与表关联的类的名称
字符串modelNumber=doc.select（“表tr:has（td:contains（modelNumber））td b a”）.text（）；
字符串modelUrl=doc.select（“表tr:has（td:contains（型号））td b a”）.attr（“href”）；
System.out.println（modelNumber+“：”+modelUrl）；
}
}

如果这不是你想要的，请告诉我

编辑：结果：

A4-3300：https://www.cpu-world.com/CPUs/K10/AMD-A4-Series%20A4-3300.html
进程已完成，退出代码为0

编辑：

这比一盒青蛙还疯狂，但我们走了。。。我将让您将2和2放在一起，通过URL进行迭代，以获得您所关注的各个详细信息：

import org.jsoup.Connection；
导入org.jsoup.jsoup；
导入org.jsoup.nodes.Document；
导入org.jsoup.nodes.Element；
导入org.jsoup.select.Elements；
导入org.springframework.web.client.rest模板；
导入java.io.IOException；
导入java.util.HashMap；
导入java.util.List；
导入java.util.Map；
导入java.util.Optional；
导入java.util.stream.collector；
公共类CpuWorld{
公共静态最终字符串CPU\u WORLD\u COM\u URL=”https://www.cpu-world.com/info/AMD/AMD_A4-Series.html";
公共静态最终字符串加扰\u数据\u头="是的，这是可能的。向我们展示你迄今为止所做的尝试，添加一些代码。网上有很多教程。顺便问一下，你为什么用这个问题来标记android？你只是在Chrome或Edge浏览器中点击了查看源代码按钮吗？这个页面是一个AJAX页面，特别是，该表是由JavaScript加载的，因此JSoup是not将获得您所请求的数据。您想要表中每个产品的URL
和型号？这就是您的问题所在吗？我可以制定解决方案，但它不会使用JSoup库。我在Chrome中按F12查看那里的代码。我从未想到会是这样（AJAX），只有这样我才能从许多其他网站获得数据而不会出现问题。你怎么知道这是AJAX，这样我就知道下次要找什么？是的，URL和型号就是我要找的。好吧……问题1你怎么知道Java脚本、AJAX（或Angular JS、Type Script、React JS）-打印出轮询web服务器时下载的HTML…查看源代码按钮有时会有所帮助，但有时会使您感到困惑，因为它显示的HTML并不总是服务器第一次轮询时下载的HTML。问题2是否有方法从加载的脚本中获取信息>页面？是的，有点…有脚本执行包，但我只是生病了，我需要躺下，所以我现在不能写…：）很抱歉“免费评论”（没有答案）…我需要先睡觉。我复制了你直接粘贴的代码…我得到：HTTP-403:[禁止]，类别：客户端错误
我使用谷歌云平台运行Java-因此，除非CPU World禁止GCP，否则我仍然不明白…（我不使用JSoup，但我有JAR并运行了这个类
，收到了一个HTTP 403
）很有趣，但为什么你要在GCP中运行？我怀疑可能有一个IP范围过滤器，所以可能是专业的