Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/django/21.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何使用Jsoup从html数据中获取图像源和描述_Java_Html_Jsoup - Fatal编程技术网

Java 如何使用Jsoup从html数据中获取图像源和描述

Java 如何使用Jsoup从html数据中获取图像源和描述,java,html,jsoup,Java,Html,Jsoup,我正在尝试解析atom提要,以使用RomeAPI提取提要。atom提要为我提供了content属性,该属性包含文章的图像和描述。 以下是atom提要的url:。 现在我想从内容部分提取图像和描述 <entry> <id>tag:news.google.com,2005:cluster=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-m

我正在尝试解析atom提要,以使用RomeAPI提取提要。atom提要为我提供了content属性,该属性包含文章的图像和描述。 以下是atom提要的url:。 现在我想从内容部分提取图像和描述

 <entry>
<id>tag:news.google.com,2005:cluster=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222</id>
<title type="html">'Not Just GST Stuck In Parliament. Matter of Sorrow': PM Narendra Modi - NDTV</title>
<updated>2015-12-10T06:03:54Z</updated>
<link rel="alternate" type="text/html" href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=in&amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52779006372283&amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222" hreflang="en"/>
<content type="html">&lt;table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;">&lt;tr>&lt;td width="80" align="center" valign="top">&lt;font style="font-size:85%;font-family:arial,sans-serif">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222">&lt;img src="//t3.gstatic.com/images?q=tbn:ANd9GcSNi4SJFo9q9PXKPOjJkiUlfk2GFRzRoBlwK6UsiSQ8np66JDvgQiYTdN4Fknntb7bVjdR-NuM" alt="" border="1" width="80" height="80">&lt;br>&lt;font size="-2">NDTV&lt;/font>&lt;/a>&lt;/font>&lt;/td>&lt;td valign="top" class="j">&lt;font style="font-size:85%;font-family:arial,sans-serif">&lt;br>&lt;div style="padding-top:0.8em;">&lt;img alt="" height="1" width="1">&lt;/div>&lt;div class="lh">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222">&lt;b>&amp;#39;Not Just GST Stuck In Parliament. Matter of Sorrow&amp;#39;: PM &lt;b>Narendra Modi&lt;/b>&lt;/b>&lt;/a>&lt;br>&lt;font size="-1">&lt;b>&lt;font color="#6f6f6f">NDTV&lt;/font>&lt;/b>&lt;/font>&lt;br>&lt;font size="-1">With repeated disruptions stalling legislation including the GST or Goods and Services Tax, Prime Minister &lt;b>Narendra Modi&lt;/b> today said it was a &amp;quot;matter of sorrow&amp;quot; that Parliament was not running. &amp;quot;It is not only GST, but many pro-poor steps are stuck in&amp;nbsp;...&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNEVhO7UtISsITzRIFwxTVFwK8BTDQ&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.india.com/news/india/narendra-modis-stern-message-to-congress-democracy-cannot-run-on-whims-of-some-773082/">&lt;b>Narendra Modi&amp;#39;s&lt;/b> stern message to Congress: Democracy cannot run on whims of some&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>India.com&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNGkBqqpn2OhEI6w68lLCIXMDppu-Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.mid-day.com/articles/jagran-forum-catch-pm-narendra-modi-other-leaders-live/16757192">Jagran Forum: Catch PM &lt;b>Narendra Modi&lt;/b>, other leaders live&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>Mid-Day&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNHPkB8Wy_-cDqqZrdfcn1cVUKP-Kg&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.oneindia.com/india/democracy-cant-be-restricted-to-elections-only-narendra-modi-1951641.html">Democracy can&amp;#39;t be restricted to elections only, says &lt;b>Narendra Modi&lt;/b>&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>Oneindia&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1" class="p">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNFhxDKEsImpQqu0GccMt4MCiPydVw&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.abplive.in/india-news/everyone-must-feel-he-or-she-is-working-for-indias-progress-says-narendra-modi-258229">&lt;nobr>ABP Live&lt;/nobr>&lt;/a>&lt;/font>&lt;br>&lt;font class="p" size="-1">&lt;a class="p" href="http://news.google.com/news/more?ncl=dac7xEJd70rfdkM8gcjOwSJn8BK9M&amp;amp;authuser=0&amp;amp;ned=in">&lt;nobr>&lt;b>all 29 news articles&amp;nbsp;&amp;raquo;&lt;/b>&lt;/nobr>&lt;/a>&lt;/font>&lt;/div>&lt;/font>&lt;/td>&lt;/tr>&lt;/table></content>
</entry>
但它什么也不返回。此外,我不知道如何继续提取描述:

&lt;br>&lt;font size="-1">NEW DELHI: Putting the Ufa process back on track India and Pakistan on Wednesday signaled process of reducing tensions by announcing Comprehensive Bilateral Dialogue to be led by Foreign Secretaries and prepared the ground for a visit by Prime&amp;nbsp;...&lt;/font>

请帮我解决这个问题。

试试这个代码。请注意,RSS提要是通过Jsoup直接获取的

Document news = Jsoup.connect("http://news.google.com/news/section?output=atom&ned=in&q=narendra%20modi").get();

int i=0;
for (Element entryContent : news.select("entry > content")) {
    System.out.format("\n## ENTRY %d\n", ++i);
    for (Element el : Jsoup.parse(entryContent.text()).select("img[src], tr td.j font[size]:nth-of-type(2)")) {

        String elementTagName = el.tagName();  

        if (elementTagName.equalsIgnoreCase("img")) {
            System.out.println("src attribute is : " + el.attr("src"));
        } else if (elementTagName.equalsIgnoreCase("font")) {
            System.out.println("description is : " + el.text());
        } else {
            System.out.println("Unexpected element >> " + el.html());
        }
    }
}
样本输出
##条目1
src属性为:://t0.gstatic.com/images?q=tbn:and 9gcslee4ulbtceomsudulhcajdzwmlavaxjvdc0913qbk3x1opzh3s1rbplzneadxqv5memm0dh3
描述是:由于包括商品及服务税(GST)或商品和服务税在内的立法一再受到干扰,总理纳伦德拉·莫迪今天表示,议会没有运作是一件“令人悲伤的事情”。“这不仅是商品及服务税,而且许多有利于穷人的措施都陷入了困境。。。
##条目2
src属性为://t1.gstatic.com/images?q=tbn:and 9gcqdjptlobi9f2ktov11\u x5kqHC4inID47xKD3we\ZC5rHP1Lps96sYHs\u n0pbo9wkd5kkuea8
描述是:总理纳伦德拉·莫迪在Facebook最受欢迎排行榜上高居榜首
(...)

在JSoup 1.8.3上测试

您可以使用ROME来提取条目(或使用jackson和xml插件)。然后获取每个条目的HTML内容,并使用JSoup解析它(转换和预先转换)。然后使用return HTML元素并搜索img标记以提取src-attribute。运行代码时会出现以下异常:线程中的异常“main”“org.jsoup.UnsupportedAdminTypeException:未处理的内容类型。必须是text/*、application/xml或application/xhtml+xml。Mimetype=application/atom+xml,URL=org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:472)org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)@GarimaTripathi您使用的是哪个版本的JSoup?@GarimaTripathi或者,您可以调用
ignoreContentType(true)
来解决此问题。所以新闻可以像这样下载:
documentnews=Jsoup.connect(“http://news.google.com/news/section?output=atom&ned=in&q=narendra%20modi”).ignoreContentType(true.get()
Document news = Jsoup.connect("http://news.google.com/news/section?output=atom&ned=in&q=narendra%20modi").get();

int i=0;
for (Element entryContent : news.select("entry > content")) {
    System.out.format("\n## ENTRY %d\n", ++i);
    for (Element el : Jsoup.parse(entryContent.text()).select("img[src], tr td.j font[size]:nth-of-type(2)")) {

        String elementTagName = el.tagName();  

        if (elementTagName.equalsIgnoreCase("img")) {
            System.out.println("src attribute is : " + el.attr("src"));
        } else if (elementTagName.equalsIgnoreCase("font")) {
            System.out.println("description is : " + el.text());
        } else {
            System.out.println("Unexpected element >> " + el.html());
        }
    }
}