Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/387.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java J组提取p、ul、h3和img_Java_Jsoup - Fatal编程技术网

Java J组提取p、ul、h3和img

Java J组提取p、ul、h3和img,java,jsoup,Java,Jsoup,基本上,我试图从以下url中提取p、ul、h3和img: 我现在面临的问题是让所有内容一个接一个地显示,类似于网站的布局 我尝试了for循环来生成绝对img链接,但是这样做,布局就运行了 下面是我使用的代码: String url = "http://www.hardwarezone.com.sg/review-sony-playstation-4-does-greatness-await"; Document doc = Jsoup.connect(url).get(); E

基本上,我试图从以下url中提取p、ul、h3和img:

我现在面临的问题是让所有内容一个接一个地显示,类似于网站的布局

我尝试了for循环来生成绝对img链接,但是这样做,布局就运行了

下面是我使用的代码:

   String url = "http://www.hardwarezone.com.sg/review-sony-playstation-4-does-greatness-await";
   Document doc = Jsoup.connect(url).get();
   Elements content = doc.select("#content p, #content table ul, #content h3");
   Elements img = doc.select("#content [src]"); 

在提取所有您想要的元素之后,循环并用图像的绝对url替换所有img元素的src。您可以使用Jsoup中的Node类的函数检索:

for (Element bb : img)

String src = bb.attr("abs:src");
System.out.println(src);      

更新:

添加此循环可从
元素内部删除
元素,但保留
元素:

<p class="rtecenter"><img src="http://www.hardwarezone.com.sg/files/img/2013/12/rearports.jpg" width="700" height="232" title="The entire rear side is covered in cooling vents. This is also the first Playstation to ditch all analog connectors." alt="" /></p>
<img src="http://www.hardwarezone.com.sg/files/img/2013/12/rearports.jpg" width="700" height="232" title="The entire rear side is covered in cooling vents. This is also the first Playstation to ditch all analog connectors." alt="" />
<p>Rather frustratingly, especially for a next-gen console that is expected to last at least the next five years, the PS4 doesn't support the Wireless 802.11ac standard, instead utilizing the older 802.11b/g/n network, and even then 5GHz bands are not supported! So you're stuck with&nbsp;2.4 GHz speeds. This makes a wired connection almost mandatory, as downloading games or even large update files over wireless can be extremely sluggish.</p>
<h3 class="page_title">&nbsp;</h3>  

省略
for
循环的大括号是故意的?在这个例子中,是的。你说“这样做,布局运行”是什么意思?之前我没有声明img#内容[src]与#内容p,#内容表ul,#内容h3位于同一行。打印出来的布局与我试图从中获取代码的站点类似。然而,我无法通过这种方式获得绝对链接,因此一些图像拒绝显示。通过单独创建for循环,我可以获得绝对链接,但无法保持与前面代码相同的布局,因为循环首先运行以提供图像的链接,或者我将提取其他内容。我尝试了代码。它起了部分作用。现在,一些旧的非绝对链接仍然存在。这会导致“双图像”问题,即有两个相同的图像,但只能查看具有绝对图像的图像:(.真的很感谢你的努力。@Wilson你在原始问题中提供的相同url是否也会发生这种情况?如果我在用abs url替换src后循环并打印出所有img元素,我发现该页面上有17个图像,它们都有abs url。奇怪。嗯,我发现了问题。显然y一些也在提取img文件。@Wilson是的,这是有道理的。请参阅我的更新答案,以获得另一个循环,该循环将从元素中删除元素,但仍保留中的所有文本。太好了!再次出现另一个问题:(现在使用e.owntext,它将删除所有href链接。
<p class="rtecenter"><img src="http://www.hardwarezone.com.sg/files/img/2013/12/rearports.jpg" width="700" height="232" title="The entire rear side is covered in cooling vents. This is also the first Playstation to ditch all analog connectors." alt="" /></p>
<img src="http://www.hardwarezone.com.sg/files/img/2013/12/rearports.jpg" width="700" height="232" title="The entire rear side is covered in cooling vents. This is also the first Playstation to ditch all analog connectors." alt="" />
<p>Rather frustratingly, especially for a next-gen console that is expected to last at least the next five years, the PS4 doesn't support the Wireless 802.11ac standard, instead utilizing the older 802.11b/g/n network, and even then 5GHz bands are not supported! So you're stuck with&nbsp;2.4 GHz speeds. This makes a wired connection almost mandatory, as downloading games or even large update files over wireless can be extremely sluggish.</p>
<h3 class="page_title">&nbsp;</h3>  
for (Element e : content) {
    if (e.nodeName().equals("p")) {
        for (Element child : e.children()) {
            if (child.nodeName().equals("img")) child.remove();
        }
    }
}