Python-RSS Web抓取-选择正确的元素_Python_Xml_Rss

Python-RSS Web抓取-选择正确的元素

python xml rss

Python-RSS Web抓取-选择正确的元素,python,xml,rss,Python,Xml,Rss,我发布了一篇文章，帮助我从RSS提要中获取数据的输出格式我得到的答案正是我所需要的，现在输出格式符合要求更新后的代码如下： import urllib2 from urllib2 import urlopen import re import cookielib from cookielib import CookieJar import time cj = CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProces

我发布了一篇文章，帮助我从RSS提要中获取数据的输出格式

我得到的答案正是我所需要的，现在输出格式符合要求

更新后的代码如下：

import urllib2
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent','Mozilla/5.0')]

def main():
    try:
        page = 'http://feeds.link.co.uk/thelink/rss.xml'
        sourceCode = opener.open(page).read()

        try:
            titles = re.findall(r'<title>(.*?)</title>',sourceCode)
            desc = re.findall(r'<description>(.*?)</description>',sourceCode)
            links = re.findall(r'<link>(.*?)</link>',sourceCode)
            pub = re.findall(r'<pubDate>(.*?)</pubDate>',sourceCode)

            for i in range(len(titles)):
                print titles[i]
                print desc[i]
                print links[i]
                print pub[i]
                print ""

        except Exception, e:
            print str(e)

    except Exception, e:
        print str(e)

main()

<rss>  
  <channel> 
    <title>Title1</title>  #USING THIS WOULD BE OK, BUT **
    <link>http://link.co.uk</link>  
    <description>The descriptor</description>  
    <language>en-gb</language>  
    <lastBuildDate>Sat, 18 Jan 2014 06:32:19 GMT</lastBuildDate>  
    <copyright>Usable</copyright>  
    <image> #**THIS IS THE AREA I WANT TO EXCLUDE!!
      <url>http://link.co.uk.1gif</url>  
      <title>Title2</title> #DONT WANT THIS ELEMENT!! 
      <link>http://link.co.uk/info</link>  
      <width>120</width>  
      <height>60</height> 
    </image>  #**THIS IS THE AREA I WANT TO EXCLUDE!!
    <ttl>15</ttl>  
    <atom:link href="http://thelink" rel="self" type="application/rss+xml"/>  ###
    <item> #I WANT TO START THE SCRAPE FROM HERE!!
      <title>Title3</title>  
      <description>This will be the first decription.</description>  
      <link>http://www.thelink3.co.uk</link>  
      <guid isPermaLink="false">http://www.thelink.co.uk/5790820</guid>  
      <pubDate>Sat, 18 Jan 2014 09:53:10 GMT</pubDate>  
    </item>  
    <item> 
      <title>Title4</title>  
      <description>This will be the second description.</description>  
      <link>http://www.thelink3.co.uk/second link</link>  
      <guid isPermaLink="false">http://www.thelink.co.uk/5790635</guid>  
      <pubDate>Sat, 18 Jan 2014 09:56:14 GMT</pubDate>   
    </item>  #I WANT THE SCRAPE TO END HERE
</rss>

导入urllib2
从urllib2导入urlopen
进口稀土
进口cookielib
从cookielib导入CookieJar
导入时间
cj=CookieJar（）
opener=urllib2.build_opener（urllib2.HTTPCookieProcessor（cj））
opener.addheaders=[（'User-agent'，'Mozilla/5.0'）]
def main（）：
尝试：
佩奇http://feeds.link.co.uk/thelink/rss.xml'
sourceCode=opener.open（page.read）（）
尝试：
titles=re.findall（r'（.*？），源代码）
desc=re.findall（r'（.*？），源代码）
links=re.findall（r'（.*？），源代码）
pub=re.findall（r'（.*？），源代码）
对于范围内的i（len（titles））：
印刷标题[i]
打印说明[i]
打印链接[i]
印刷酒吧[i]
打印“”
除例外情况外，e：
打印str（e）
除例外情况外，e：
打印str（e）
main（）

这将按照我的要求运行并输出到控制台，但当它完成时，我收到一个“列表索引超出范围”错误，因为元素与计数不匹配

我从中提取数据的xml在标题中使用了一些元素，这些元素会导致标题、描述和链接顺序错误，并导致错误

xml如下所示：

import urllib2
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent','Mozilla/5.0')]

def main():
    try:
        page = 'http://feeds.link.co.uk/thelink/rss.xml'
        sourceCode = opener.open(page).read()

        try:
            titles = re.findall(r'<title>(.*?)</title>',sourceCode)
            desc = re.findall(r'<description>(.*?)</description>',sourceCode)
            links = re.findall(r'<link>(.*?)</link>',sourceCode)
            pub = re.findall(r'<pubDate>(.*?)</pubDate>',sourceCode)

            for i in range(len(titles)):
                print titles[i]
                print desc[i]
                print links[i]
                print pub[i]
                print ""

        except Exception, e:
            print str(e)

    except Exception, e:
        print str(e)

main()

<rss>  
  <channel> 
    <title>Title1</title>  #USING THIS WOULD BE OK, BUT **
    <link>http://link.co.uk</link>  
    <description>The descriptor</description>  
    <language>en-gb</language>  
    <lastBuildDate>Sat, 18 Jan 2014 06:32:19 GMT</lastBuildDate>  
    <copyright>Usable</copyright>  
    <image> #**THIS IS THE AREA I WANT TO EXCLUDE!!
      <url>http://link.co.uk.1gif</url>  
      <title>Title2</title> #DONT WANT THIS ELEMENT!! 
      <link>http://link.co.uk/info</link>  
      <width>120</width>  
      <height>60</height> 
    </image>  #**THIS IS THE AREA I WANT TO EXCLUDE!!
    <ttl>15</ttl>  
    <atom:link href="http://thelink" rel="self" type="application/rss+xml"/>  ###
    <item> #I WANT TO START THE SCRAPE FROM HERE!!
      <title>Title3</title>  
      <description>This will be the first decription.</description>  
      <link>http://www.thelink3.co.uk</link>  
      <guid isPermaLink="false">http://www.thelink.co.uk/5790820</guid>  
      <pubDate>Sat, 18 Jan 2014 09:53:10 GMT</pubDate>  
    </item>  
    <item> 
      <title>Title4</title>  
      <description>This will be the second description.</description>  
      <link>http://www.thelink3.co.uk/second link</link>  
      <guid isPermaLink="false">http://www.thelink.co.uk/5790635</guid>  
      <pubDate>Sat, 18 Jan 2014 09:56:14 GMT</pubDate>   
    </item>  #I WANT THE SCRAPE TO END HERE
</rss>


标题1：使用这个可以，但是**
http://link.co.uk  
描述符
欧洲标准
2014年1月18日星期六06:32:19 GMT
实用的
#**这是我要排除的区域！！
http://link.co.uk.1gif  
标题2：不要这个元素！！
http://link.co.uk/info  
120
60
#**这是我要排除的区域！！
15
###
#我想从这里开始刮！！
标题3
这将是第一次描述。
http://www.thelink3.co.uk  
http://www.thelink.co.uk/5790820  
2014年1月18日星期六09:53:10 GMT
标题4
这将是第二个描述。
http://www.thelink3.co.uk/second 链接
http://www.thelink.co.uk/5790635  
2014年1月18日星期六09:56:14 GMT
#我希望这场擦伤到此为止

有没有办法更改python代码，以确保它忽略标题元素，而只使用下面的常用元素

我已经检查了一些RSS提要，它们是以相同的方式创建的，因此我使用此代码并更改URL，以便从几个RSS提要中刮取内容，以便在raspberry Pi控制台上使用

非常感谢您的帮助。

您是否尝试过使用BeautifulSoup4？找到你想要的元素会容易得多

代码如下：

title = soup.find('title')
if title:
    print title.text

此外，为了避免“元素超出范围”错误，您可以首先检查列表中是否有足够的元素：

if len(titles) < i: # Doens't have the index
    return

如果len（titles）


我希望这有帮助：）
您应该使用适当的xml解析器，比如，而不是正则表达式
from bs4 import BeautifulSoup

data = sourceCode # your sourceCode variable from your main() function

soup = BeautifulSoup(data)
for item in soup.find_all('item'):
    for tag in ['title', 'description', 'link', 'pubdate']:
        print(tag.upper(), item.find(tag).text)
    print()

输出：
TITLE Title3
DESCRIPTION This will be the first decription.
LINK 
PUBDATE Sat, 18 Jan 2014 09:53:10 GMT

TITLE Title4
DESCRIPTION This will be the second description.
LINK 
PUBDATE Sat, 18 Jan 2014 09:56:14 GMT

那么，我能说什么呢
BeautifulSoup本可以帮我省下很多打字时间：）
BeautifulSoup是否可用于Windows平台？对不起，我应该提到我正在用Python for Windows编写代码，准备好后将移植到Linux。@塞曼：是的，它可用于Windows。要使用pip
安装，请执行pip安装beautifulsoup4
。我不习惯在Windows中使用Python，但如果您有pip，我想说你可以在windows上安装BeautifulSoup，这里有一篇相关的帖子：在windows上安装并运行BeautifulSoup4。现在回到编码上来。谢谢你的帮助。是的，我可以使用BeautifulSoup，但代码几乎完成了我所需的工作。我只需要选择正确的元素。并且需要一段时间。睡眠（10）功能增加了它的工作待遇。谢谢大家的帮助。