Python 自动从网页中提取提要链接（atom、rss等）_Python_Api_Rss_Feed_Atom Feed

Python 自动从网页中提取提要链接（atom、rss等）

python api rss

Python 自动从网页中提取提要链接（atom、rss等）,python,api,rss,feed,atom-feed,Python,Api,Rss,Feed,Atom Feed,我有一个巨大的URL列表，我的任务是将它们提供给一个python脚本，如果有任何URL，该脚本应该输出这些URL。是否有API库或代码可以提供帮助？我不知道任何现有库，但Atom或RSS提要通常在部分用标记表示，如下所示： <link rel="alternative" type="application/rss+xml" href="http://link.to/feed"> <link rel="alternative" type="application/atom+xml

我有一个巨大的URL列表，我的任务是将它们提供给一个python脚本，如果有任何URL，该脚本应该输出这些URL。是否有API库或代码可以提供帮助？

我不知道任何现有库，但Atom或RSS提要通常在

部分用

标记表示，如下所示：

<link rel="alternative" type="application/rss+xml" href="http://link.to/feed">
<link rel="alternative" type="application/atom+xml" href="http://link.to/feed">

简单的方法是使用HTML解析器下载和解析这些URL，比如获取相关标签的

href

属性（例如，所有链接是否都是

http://.../

？你知道它们是否都在

href

或

link

标记中吗？提要中的所有链接是否都将指向其他提要？等等），我建议使用任何方法，从简单的正则表达式到从提要中提取链接的直接解析模块

就解析模块而言，我只能推荐。尽管即使是最好的解析器也只能推荐到这一步——尤其是在我上面提到的情况下，如果你不能保证数据中的所有链接都是指向其他提要的链接，那么你必须自己做一些额外的爬行和探测。

我推荐PAR的第二个华夫悖论对HTML进行加密，然后获取引用提要的标记。我通常使用的代码是：

from BeautifulSoup import BeautifulSoup as parser

def detect_feeds_in_HTML(input_stream):
    """ examines an open text stream with HTML for referenced feeds.

    This is achieved by detecting all ``link`` tags that reference a feed in HTML.

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method.
    :type input_stream: an input stream (e.g. open file or URL)
    :return: a list of tuples ``(url, feed_type)``
    :rtype: ``list(tuple(str, str))``
    """
    # check if really an input stream
    if not hasattr(input_stream, "read"):
        raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream))
    result = []
    # get the textual data (the HTML) from the input stream
    html = parser(input_stream.read())
    # find all links that have an "alternate" attribute
    feed_urls = html.findAll("link", rel="alternate")
    # extract URL and type
    for feed_link in feed_urls:
        url = feed_link.get("href", None)
        # if a valid URL is there
        if url:
            result.append(url)
    return result

有：

feedfinder不再维护，但现在有了。

>>> import feedfinder
>>>
>>> feedfinder.feed('scripting.com')
'http://scripting.com/rss.xml'
>>>
>>> feedfinder.feeds('scripting.com')
['http://delong.typepad.com/sdj/atom.xml', 
 'http://delong.typepad.com/sdj/index.rdf', 
 'http://delong.typepad.com/sdj/rss.xml']
>>>