Android 使用jsoup解析Html的问题_Android_Html Parsing_Jsoup

Android 使用jsoup解析Html的问题

android

Android 使用jsoup解析Html的问题,android,html-parsing,jsoup,Android,Html Parsing,Jsoup,我正在尝试使用jsoup来解析它我的代码是： doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get(); Elements items = doc.select("item"); Log.d(TAG, "Items size : " + items.size()); for (Element item : items) { Log.d(

我正在尝试使用jsoup来解析它

我的代码是：

doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get();

            Elements items = doc.select("item");
            Log.d(TAG, "Items size : " + items.size());
            for (Element item : items) {
                Log.d(TAG, "in for loop of items");

                Element titleElement = item.select("title").first();
                mTitle = titleElement.text().toString();
                Log.d(TAG, "title is : " + mTitle);

                Element linkElement = item.select("link").first();
                mLink = linkElement.text().toString();
                Log.d(TAG, "link is : " + mLink);

                Element descElement = item.select("description").first();
                mDesc = descElement.text().toString();
                Log.d(TAG, "description is : " + mDesc);


            }

我得到以下输出：

in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : 
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.<div class="feedflare"> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo" border="0"></img></a> </div><img src="http://feeds.feedburner.com/~r/reuters/audio/newsmakerus/rss/mp3/~4/NX3AY96GfGk" height="1" width="1"/>

我应该在代码中更改什么

如何实现我的目标。请帮帮我

提前谢谢你

您获取的

rss

内容中有两个问题

链接

文本不在

标记内，但在标记外

description

标记中有一些

转义html

内容

PFB修改后的代码

在

Browser

中查看

URL

时，我还发现了一些干净的html内容，当解析这些内容时，可以轻松提取所需的字段。您可以在

Jsoup

中将

userAgent

设置为

Browser

。但如何获取内容取决于您

    doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
    System.out.println(doc.html());
    System.out.println("================================");
    Elements items = doc.select("item");
    for (Element item : items) {

        Element titleElement = item.select("title").first();
        String mTitle = titleElement.text();
        System.out.println("title is : " + mTitle);

        /*
         * The link in the rss is as follows
         *  <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3 
         *  which doesn't fall in the <link> element but falls under <item> TextNode
         */
        String  mLink = item.ownText(); //  
        System.out.println("link is : " + mLink);

        Element descElement = item.select("description").first();
        /*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
         * "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
         */
        String  mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text(); 
        System.out.println("description is : " + mDesc);

    }

doc=Jsoup.connect（“http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/）超时（0.get（）；
System.out.println（doc.html（））；
System.out.println（“=======================================================”）；
元素项目=单据选择（“项目”）；
对于（元素项：项）{
元素标题元素=项。选择（“标题”）.first（）；
字符串mTitle=titleElement.text（）；
System.out.println（“标题为：“+mTitle”）；
/*
*rss中的链接如下
*  http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3
*它不在元素中，但在TextNode下
*/
字符串mLink=item.ownText（）；//
System.out.println（“链接为：+mLink”）；
Element descElement=item.select（“description”）.first（）；
/*取消显示html内容，将其解析为一个文档，然后只提取内容中留下所有html标记的文本
*“/”是传递的伪baseURI，因为我们不关心解析内容中的链接。
*/
字符串mDesc=Parser.parse（Parser.unescapeEntities（descElement.text（），false），“/”.text（）；
System.out.println（“说明为：“+mDesc”）；
}

请查看此链接好吗

    doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
    System.out.println(doc.html());
    System.out.println("================================");
    Elements items = doc.select("item");
    for (Element item : items) {

        Element titleElement = item.select("title").first();
        String mTitle = titleElement.text();
        System.out.println("title is : " + mTitle);

        /*
         * The link in the rss is as follows
         *  <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3 
         *  which doesn't fall in the <link> element but falls under <item> TextNode
         */
        String  mLink = item.ownText(); //  
        System.out.println("link is : " + mLink);

        Element descElement = item.select("description").first();
        /*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
         * "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
         */
        String  mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text(); 
        System.out.println("description is : " + mDesc);

    }