Python 如何遍历整个域而不是提供单个链接_Python_Scrapy

Python 如何遍历整个域而不是提供单个链接

python scrapy

Python 如何遍历整个域而不是提供单个链接,python,scrapy,Python,Scrapy,目前，我们的spider正在处理一个硬编码URL列表，我们希望将其更改为只处理主域我们怎样才能更改下面的代码，使其只需要域 https://www.example.com/shop/ 如果有一个很好的例子来源，那将是伟大的 def start_requests(self): urls = [ # 'https://www.example.com/shop/outdoors-unknown-hart-creek-fleece-hoodie'

目前，我们的spider正在处理一个硬编码URL列表，我们希望将其更改为只处理主域

我们怎样才能更改下面的代码，使其只需要域

https://www.example.com/shop/

如果有一个很好的例子来源，那将是伟大的

    def start_requests(self):
        urls = [
#                'https://www.example.com/shop/outdoors-unknown-hart-creek-fleece-hoodie',
                'https://www.example.com/shop/adidas-unknown-essentials-cotton-fleece-3s-over-head-hoodie#repChildCatSku=111767466',
                'https://www.example.com/shop/unknown-metallic-long-sleeve-shirt#repChildCatSku=115673740',
                'https://www.example.com/shop/unknown-fleece-full-zip-hoodie#repChildCatSku=111121673',
                'https://www.example.com/shop/unknown-therma-fleece-training-hoodie#repChildCatSku=114784077',
                'https://www.example.com/shop/under-unknown-rival-fleece-crew-sweater#repChildCatSku=114636980',
                'https://www.example.com/shop/unknown-element-1-2-zip-top#repChildCatSku=114794996',
                'https://www.example.com/shop/unknown-element-1-2-zip-top#repChildCatSku=114794996',
                'https://www.example.com/shop/under-unknown-rival-fleece-full-zip-hoodie#repChildCatSku=115448841',
                'https://www.example.com/shop/under-unknown-rival-fleece-crew-sweater#repChildCatSku=114636980',
                'https://www.example.com/shop/adidas-unknown-essentials-3-stripe-fleece-sweatshirt#repChildCatSku=115001812',
                'https://www.example.com/shop/under-unknown-fleece-logo-hoodie#repChildCatSku=115305875',
                'https://www.example.com/shop/under-unknown-heatgear-long-sleeve-shirt#repChildCatSku=107534192',
                'https://www.example.com/shop/unknown-long-sleeve-legend-hoodie#repChildCatSku=112187421',
                'https://www.example.com/shop/unknown-element-1-2-zip-top#repChildCatSku=114794996',
                'https://www.example.com/shop/unknown-sportswear-funnel-neck-hoodie-111112208#repChildCatSku=111112208',
                'https://www.example.com/shop/unknown-therma-swoosh-fleece-training-hoodie#repChildCatSku=114784481',
            ]
        for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
            page = response.url.split("/")[-1]
            filename = 'academy-%s.txt' % page
            res2 = response.xpath("//span[@itemprop='price']/text()|//span[@itemprop='sku']/text()").extract()             

            res = '\n'.join(res2)

            with open(filename, 'w') as f:
                    f.write(res)
                    self.log('Saved file %s' % filename)

仅针对纯遍历，您可以：

class MySpider(scrapy.Spider):
    name = 'my'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com/shop/']

    def parse(self, response):
        for link in response.css('a'):
            yield response.follow(link)

但这项任务似乎毫无意义。你能详细说明你的问题吗？

所以我们被指派为我们的内部网站开发一个爬虫程序。。大多数情况下，我认为如果我们能在家里完成这项工作，这只是一个想法。无论哪种方式，目前上面的代码逻辑工作得很好，它解析出各个链接/页面，并返回我们想要在这些页面上的2位数据。但以上只是一个简单的示例，让他们看到我们可以从页面中获取一些特定的数据。。现在的问题是，我们是否可以通过提供我们的域名和根文件夹（在本例中为“shop”文件夹）让爬行器在下面爬行来让它变得更容易一些我运行了上面的代码并工作了，但我注意到的唯一一件事是，即使我提供了起始URL，它会跳出该URL并对站点的其余部分进行爬网。。如何确保它只在起始URL下进行遍历，而不在起始URL之外进行遍历？