Python 网站正则表达式_Python_Regex_Spyder

Python 网站正则表达式

python regex

Python 网站正则表达式,python,regex,spyder,Python,Regex,Spyder,是我想要正则表达式的网站到目前为止，我正在使用以下内容，其中 '.+\/news\/business[-.]\d{8}$ 这是这个代码段的一部分，与Scrapy一起使用 from scrapy.item import Item, Field from scrapy.contrib.linkextractors import LinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule class TryItem(Ite

是我想要正则表达式的网站

到目前为止，我正在使用以下内容，其中

'.+\/news\/business[-.]\d{8}$

这是这个代码段的一部分，与Scrapy一起使用

from scrapy.item import Item, Field
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class TryItem(Item):
    url = Field()

class BbchrcrawlerSpider(CrawlSpider):
    name = "bbchrcrawler"
    allowed_domains = ["www.bbc.com"]
    start_urls = ['http://www.bbc.com/news/business-']

    rules = (Rule(LinkExtractor(allow=['.+\/news\/business+\-d{8}$']), callback='parse_item', follow=True),)

    def parse_item(self, response):
        Item = TryItem()
        Item['url'] = response.url
        yield Item

获取URL的正确方法是什么，以提取具有相同格式的多个页面

结果应收集以下格式的URL：

bbc.com/news/business-######

你可以试试这个：

pattern = "bbc\.com/news/business-\d+"
rules = (Rule(LinkExtractor(allow=[pattern]), callback='parse_item', follow=True),)

您可以尝试以下方法：

pattern = "bbc\.com/news/business-\d+"
rules = (Rule(LinkExtractor(allow=[pattern]), callback='parse_item', follow=True),)

+/news/business-\d{8}

应该足够了。您使用了不同的代码。代码中的

\-d{8}

与

和8

s匹配。

+/news/business-\d{8}

应该足够了。您使用了不同的代码。代码中的

\-d{8}

与

和8

s匹配。