Python 使用正则表达式在网页中查找rss链接_Python_Regex_Rss

Python 使用正则表达式在网页中查找rss链接

python regex rss

Python 使用正则表达式在网页中查找rss链接,python,regex,rss,Python,Regex,Rss,我试图在网站中找到rss链接。但我的代码也会返回imgsrc和css链接，因为它的src包含rss单词这是我的代码： import urllib2 import re website = urllib2.urlopen("http://www.apple.com/rss") html = website.read() links = re.findall('"((http)s?://.*rss.*)"',html) for link in links: print link 我不建议用正

我试图在网站中找到rss链接。但我的代码也会返回imgsrc和css链接，因为它的src包含rss单词

这是我的代码：

import urllib2
import re

website = urllib2.urlopen("http://www.apple.com/rss")
html = website.read()
links = re.findall('"((http)s?://.*rss.*)"',html)
for link in links:
print link

我不建议用正则表达式解析HTML。有更好的工具来查找网页上的链接。我最喜欢的是

以上内容将允许您迭代每个链接。然后，您需要推断链接是否引用RSS提要。这里有一些方法你可以做到这一点

在url中查找与RSS相关的关键字
发出请求并检查响应类型（
```
应用程序/rss+xml
```
）

如果不实际检查服务器响应，您将不知道某个内容是否是RSS。类似

http://www.example.com/f

可能是RSS源。只有检查后才能确定。

##从顶部移除
## removing from top
html = re.sub('.*?<div id="container">', "", html)

## remove from bottom
html = re.sub('<div class="callout">.*', "", html)

## then match
links = re.findall('<li[^>]*>\s*<a href="(https?://[^"]*)"', html, re.IGNORECASE)
## you can push the text rss inside the pattern if you want

html=re.sub（'.'？'，''，html）
##从底部移除
html=re.sub（'.''，''，html）
##然后匹配
links=re.findall（']*>\s*但是如果站点不一样怎么办？我正在为许多网页执行此操作。我手动执行此操作。我不知道有什么更好的解决方案。但是如果您懒得手动执行此操作，那么您可以解析页面中的所有href链接（rss和非rss）。并对链接执行HEAD请求并检查服务器响应。如果它具有“Content Type:application/xml”，则确保它是RSS链接。但这会慢得多，同时还会计算带宽。
## removing from top
html = re.sub('.*?<div id="container">', "", html)

## remove from bottom
html = re.sub('<div class="callout">.*', "", html)

## then match
links = re.findall('<li[^>]*>\s*<a href="(https?://[^"]*)"', html, re.IGNORECASE)
## you can push the text rss inside the pattern if you want