Python 如果URL比lastmod日期更新，则将其刮除-刮除_Python_Scrapy_Sitemap

Python 如果URL比lastmod日期更新，则将其刮除-刮除

python scrapy

Python 如果URL比lastmod日期更新，则将其刮除-刮除,python,scrapy,sitemap,Python,Scrapy,Sitemap,嗨，我只想抓取lastmod日期比特定日期新的页面例如：仅当lastmod为2017年9月14日或更高版本时，才刮取url 我使用此代码刮除所有页面，但我不能基于lastmoddate限制它： import requests from scrapy.spiders import SitemapSpider from urllib.parse import urljoin class MySpider(SitemapSpider): name = 'sitemap_spider'

嗨，我只想抓取

lastmod

日期比特定日期新的页面

例如：仅当

lastmod

为2017年9月14日或更高版本时，才刮取url

我使用此代码刮除所有页面，但我不能基于

lastmod

date限制它：

import requests
from scrapy.spiders import SitemapSpider
from urllib.parse import urljoin


class MySpider(SitemapSpider):
    name = 'sitemap_spider'
    robots_url = 'http://www.example.org/robots.txt'

    sitemap_urls = [robots_url]
    sitemap_follow = ['products-eg-ar']

    def parse(self, response):
        print(response.url)

这是我的

robots.txt

sitemap: /sitemap-products-eg-ar-index-1-local.xml

sitemap-products-eg-ar-index-1-local.xml

包含：

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
     <loc>/sitemap-products-eg-ar-1.xml</loc>
  </sitemap>
  <sitemap>
     <loc>/sitemaps/sitemap-products-eg-ar-2.xml</loc>
  </sitemap>
</sitemapindex>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
  <loc>/product-8112041/i/</loc>
  <priority>0.8</priority>
  <lastmod>2017-06-17</lastmod>
  <changefreq>daily</changefreq>
 </url>
</urset>

这在标准

SitemapSpider

类中是不可能的。您必须对它进行子类化，并修改它处理

urlset

的

\u parse\u sitemap

方法。由于此方法在内部使用

sitemap

模块中的

iterloc

函数，因此更糟糕的解决方案是重新定义该函数以考虑

lastmod

元素。大概是这样的：

import datetime
import scrapy

oldest = datetime.datetime.strptime('2017-09-14', '%Y-%m-%d')

def _iterloc(it, alt=False):
    for d in it:
        lastmod = datetime.datetime.strptime(d['lastmod'], '%Y-%m-%d')
        if lastmod > oldest:
            yield d['loc']

            # Also consider alternate URLs (xhtml:link rel="alternate")
            if alt and 'alternate' in d:
                for l in d['alternate']:
                    yield l

scrapy.spiders.sitemap.iterloc = _iterloc

# your spider code here