Python href中的抓取和跟随链接_Python_Web Scraping_Scrapy_Scrapy Spider

Python href中的抓取和跟随链接

python web-scraping scrapy

Python href中的抓取和跟随链接,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我对刮痧很陌生。我需要遵循href从主页的网址到多个深度。同样在href链接中，我有多个href。我需要遵循这些href，直到我到达我想要的页面。我的页面的示例html为：首页 <div class="page-categories"> <a class="menu" href="/abc.html"> <a class="menu" href="/def.html"> </div> 内abc.html <div class

我对刮痧很陌生。我需要遵循href从主页的网址到多个深度。同样在href链接中，我有多个href。我需要遵循这些href，直到我到达我想要的页面。我的页面的示例html为：

首页

<div class="page-categories">
 <a class="menu"  href="/abc.html">
 <a class="menu"  href="/def.html">
</div>

内abc.html

<div class="cell category" >
 <div class="cell-text category">
 <p class="t">
  <a id="cat-24887" href="fgh.html"/>
</p>
</div>

我需要从这个fgh.html页面中删除内容。

谁能告诉我从哪里开始。我读过关于LinkedExtractor的文章，但没有找到合适的参考资料。谢谢

据我所见，我可以说：

指向产品类别的URL始终以
```
.kat
```
指向产品的URL包含
```
id\uu
```
，后跟一组数字

让我们使用这些信息来定义spider：

换句话说，我们要求spider跟踪每个类别链接，并让我们知道它何时抓取包含

id.

的链接-这对我们来说意味着我们找到了一个产品-在这种情况下，出于示例的考虑，我正在控制台上打印页面标题。这应该给你一个很好的起点。

你能分享到你正在爬网的实际网站的链接吗？此外，请分享您目前掌握的代码。另外，您如何知道这是您需要遵循的链接：这是因为有一个以

cat-

开头的

id

属性吗？我正在学习，我正在尝试其他更简单的教程。如果您能指出一些方法，而不是从实际代码开始，这将非常有帮助。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class CodeCheckspider(CrawlSpider):
    name = "code_check"

    allowed_domains = ["www.codecheck.info"]
    start_urls = ['http://www.codecheck.info/']

    rules = [
        Rule(LinkExtractor(allow=r'\.kat$'), follow=True),
        Rule(LinkExtractor(allow=r'/id_\d+/'), callback='parse_product'),
    ]

    def parse_product(self, response):
        title = response.xpath('//title/text()').extract()[0]
        print title