Python 摘录标题+；主页链接_Python_Rss

Python 摘录标题+；主页链接

python rss

Python 摘录标题+；主页链接,python,rss,Python,Rss,我想用python制作我自己的RSS 是否可以从hdarea.org中提取标题和下载链接（“上传”）呢这里有一个这就是我到目前为止所做的 import urllib2 from BeautifulSoup import BeautifulSoup import re page = urllib2.urlopen("http://hd-area.org").read() soup = BeautifulSoup(page) for title in soup.findAll("div",

我想用python制作我自己的RSS

是否可以从hdarea.org中提取标题和下载链接（“上传”）呢

这里有一个

这就是我到目前为止所做的

import urllib2
from BeautifulSoup import BeautifulSoup
import re

page = urllib2.urlopen("http://hd-area.org").read()
soup = BeautifulSoup(page)

for title in soup.findAll("div", {"class" : "title"}):
    print (title.getText())
for a in soup.findAll('a'):
  if 'Uploaded.net' in a:
    print a['href']

它已经提取了标题

但是我被困在了应该提取链接的地方

它提取但顺序不对

有什么建议我可以确保脚本首先检查“title”和“link”是否在这个div类中

“

编辑

现在我做完了

这是最后的代码

谢谢各位，你们把我推向了正确的方向

import urllib2
from BeautifulSoup import BeautifulSoup 
import datetime
import PyRSS2Gen

print "top_rls"
page = urllib2.urlopen("http://hd-area.org/index.php?s=Cinedubs").read()
soup = BeautifulSoup(page)
movieTit = []
movieLink = []
for title in soup.findAll("div", {"class" : "title"}):
    movieTit.append(title.getText())

for span in soup.findAll('span', attrs={"style":"display:inline;"},recursive=True):
    for a in span.findAll('a'):            
        if 'ploaded' in a.getText():
            movieLink.append(a['href'])
        elif 'cloudzer' in a.getText():
            movieLink.append(a['href'])

for i in range(len(movieTit)):
    print movieTit[i]
    print movieLink[i]

rss = PyRSS2Gen.RSS2(
    title = "HD-Area Cinedubs",
    link = "http://hd-area.org/index.php?s=Cinedubs",
    description = " "
                  " ",

    lastBuildDate = datetime.datetime.now(),
    items = [
       PyRSS2Gen.RSSItem(
         title = movieTit[0],
         link = movieLink[0]),
       PyRSS2Gen.RSSItem(
         title = movieTit[1],
         link = movieLink[1]),
       PyRSS2Gen.RSSItem(
         title = movieTit[2],
         link = movieLink[2]),
       PyRSS2Gen.RSSItem(
         title = movieTit[3],
         link = movieLink[3]),
       PyRSS2Gen.RSSItem(
         title = movieTit[4],
         link = movieLink[4]),
       PyRSS2Gen.RSSItem(
         title = movieTit[5],
         link = movieLink[5]),
       PyRSS2Gen.RSSItem(
         title = movieTit[6],
         link = movieLink[6]),
       PyRSS2Gen.RSSItem(
         title = movieTit[7],
         link = movieLink[7]),
       PyRSS2Gen.RSSItem(
         title = movieTit[8],
         link = movieLink[8]),
       PyRSS2Gen.RSSItem(
         title = movieTit[9],
         link = movieLink[9]),
    ])

rss.write_xml(open("cinedubs.xml", "w"))

那么像这样:

movieTit = []
movieLink = []

for title in soup.findAll("div", {"class" : "title"}):
    movieTit.append(title.getText())
for a in soup.findAll('a'):
    if 'ploaded' in a.getText():
        movieLink.append(a['href'])

for i in range(0,len(movieTit)/2,2):
    print movieTit[i]
    print movieTit[i+1]
    print movieLink[i]
    print movieLink[i+1]

一个建议，如果首先找到所有的

<div class="topbox">

如果在页面中有多个。您可以使用find_all函数或find函数，如下所示：

soup=BeautifulSoup（第页）
#如果你想找到所有的
对于汤中的项目。查找所有（'div'，'u class='topbox'）：
#在这一行中，您必须检查标题：，或其他
#检查标签是否存在
如果item.span不是无：
title=item.span.text
#这个也一样
如果项目a不是无：
link=item.a['href']

我在页面中找不到你想要的div。如果你还需要什么，请告诉我你到底想要什么。

你是什么意思：顺序不对？是的。我想这就是我用糟糕的英语想说的：）哦，我的意思是：你说的顺序不对是什么意思？当你访问hd-area.org时，每部电影都有2个下载链接。我刮取的每个条目都应该产生1个标题+1个下载链接，以此类推。。。交替方式。现在它不这样做了。首先，它刮除所有的标题而不是所有的下载链接。假设你每部电影有2个标题和2个链接，我重写了这一点，因为Loop看起来很有效。。。我不知道为什么，但前两个链接不适合电影。可以从第一部电影的第三个链接开始吗？明白了+2.像pi一样简单；）我刚才看到的。有时他们会更改域扩展。我怎么能忽视这一点？上传*？？？将a.getText（）中的if改为：if'ploaded'，这样您也可以跳过有时它可以是大写U或不是大写U。