Python 如何遵循特定的链接和刮内容使用刮？_Python_Html_Web Scraping_Scrapy

Python 如何遵循特定的链接和刮内容使用刮？

python html web-scraping scrapy

Python 如何遵循特定的链接和刮内容使用刮？,python,html,web-scraping,scrapy,Python,Html,Web Scraping,Scrapy,假设我有一个主页，index.html和四个子页面，1.html…4.html。所有页面都以相同的方式链接到主页上我如何使用Python的scrapy跟踪这些特定链接，并按照重复模式对内容进行刮取以下是设置： index.html 注意：这是一个简化的示例。在最初的示例中，所有URL都来自web，index.html包含的链接远不止1…4.html 问题是如何遵循extact链接，该链接可以作为列表提供，但最终将来自xpath选择器–从表中选择最后一列，但仅每隔一行选择一列。使用并指定以下链

假设我有一个主页，

index.html

和四个子页面，

1.html…4.html

。所有页面都以相同的方式链接到主页上

我如何使用Python的

scrapy

跟踪这些特定链接，并按照重复模式对内容进行刮取

以下是设置：

index.html

注意：这是一个简化的示例。在最初的示例中，所有URL都来自web，

index.html

包含的链接远不止

1…4.html

问题是如何遵循extact链接，该链接可以作为列表提供，但最终将来自xpath选择器–从表中选择最后一列，但仅每隔一行选择一列。

使用并指定以下链接的规则：

基本上，问题是如何解析本地保存的html文件？@alecxe不，我只是简化了示例。问题是如何只关注某些链接。我可以为它们创建一个列表，例如

[“url1.com/…”、“url2.com/…”]

。如果不清楚，我可以扩展问题…谢谢，我会试试这个，它看起来对我很有希望…

<body>
<div class="one"><p>Text</p><a href="1.html">Link 1</a></div>
…
<div class="one"><p>Text</p><a href="4.html">Link 4</a></div>
</body>

<body>
<div class="one"><p>Text to be scraped</p></div>
</body>

class IndexSpider(Spider):
    name = "index"
    allowed_domains = ["???"]
    start_urls = [
        "index.html"
    ]

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
    name = "mydomain"
    allowed_domains = ["www.mydomain"]
    start_urls = ["http://www.mydomain/index.html",]

    rules = (Rule(SgmlLinkExtractor(allow=('\d+.html$', ),), callback="parse_items", follow=True), )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        # get the data