Python 有没有一种方法可以按口径样本中的内容过滤饲料？_Python_Calibre

Python 有没有一种方法可以按口径样本中的内容过滤饲料？

python

Python 有没有一种方法可以按口径样本中的内容过滤饲料？,python,calibre,Python,Calibre,我正在使用Calibre从各种新闻来源下载提要并将其发送到我的kindle。我想知道是否有可能使用自定义配方只下载标题或内容中有“魔术”关键字的文章。如果使用自定义配方并覆盖parse\u feed方法，则标题非常简单： from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe

我正在使用Calibre从各种新闻来源下载提要并将其发送到我的kindle。我想知道是否有可能使用自定义配方只下载标题或内容中有“魔术”关键字的文章。如果使用自定义配方并覆盖

parse\u feed

方法，则标题非常简单：

from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe    

class AdvancedUserRecipe1425579653(BasicNewsRecipe):
    title          = 'MY_TITLE'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = True    
    feeds          = [
        ('MY_TITLE', 'MY_FEED_URL'),
    ]

    def parse_feeds(self):    
        feeds = BasicNewsRecipe.parse_feeds(self)    
        for feed in feeds:    
            for article in feed.articles[:]:    
                if 'MY_MAGIC_KEYWORD' not in article.title.upper():
                    feed.articles.remove(article)    
        return feeds

但是，由于我无法在

parse_feed

方法中访问

feed.content

，我想知道是否有其他方法可以对文章内容执行此操作。

我找到了一个解决方案，这是由Kovid Goyal提供的，他是一个保持水平的人。我们的想法是覆盖

预处理_html

，如果文章内容不符合您的标准，您可以返回

None

，在我的例子中，逻辑如下：

def preprocess_html(self, soup):                        
    if 'MY_MAGIC_KEYWORD' in soup.html.head.title.string.upper():
        return soup
    if len(soup.findAll(text=re.compile('my_magic_keyword', re.IGNORECASE))) > 0:
        return soup        
    return None

您还可以覆盖

预处理\u raw\u html

，以实现同样的效果。不同之处在于，在

preprocess\u raw\u html

中，您必须将html作为字符串使用，而在

preprocess\u html

中，html已经被解析为字符串