Python 从链接中提取剪贴画

Python 从链接中提取剪贴画,python,scrapy,scrapy-spider,Python,Scrapy,Scrapy Spider,我试图提取某些链接中的信息,但我无法访问这些链接,我从start_url提取信息,我不知道为什么 这是我的密码: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from tutorial.items import DmozItem from scrapy.selector impo

我试图提取某些链接中的信息,但我无法访问这些链接,我从start_url提取信息,我不知道为什么

这是我的密码:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import DmozItem
from scrapy.selector import HtmlXPathSelector

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python"
    ]
    rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse')] 


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        item = DmozItem()

        # Extract links
        item['link'] = hxs.select("//li/a/text()").extract()  # Xpath selector for tag(s)

        print item['title']

        for cont, i in enumerate(item['link']):
            print "link: ", cont, i
我不是从“”获取链接,而是从“”获取链接


为什么?

要使
规则
起作用,您需要使用爬行爬行器,而不是一般的刮擦爬行器

此外,您需要将第一个解析函数重命名为
parse
以外的名称。否则,您将覆盖爬行爬行器的一个重要方法,它将无法工作。请参阅文档中的警告

您的代码正在从“”中删除链接,因为常规爬行器忽略了
规则
命令

此代码应适用于:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from dmoz.items import DmozItem
from scrapy.selector import HtmlXPathSelector

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python"
    ]
    rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse_item')] 


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = DmozItem()

        # Extract links
        item['link'] = hxs.select("//li/a/text()").extract()  # Xpath selector for tag(s)

        print item['link']

        for cont, i in enumerate(item['link']):
            print "link: ", cont, i

要使
规则
起作用,您需要使用爬行爬行器,而不是一般的刮擦爬行器

此外,您需要将第一个解析函数重命名为
parse
以外的名称。否则,您将覆盖爬行爬行器的一个重要方法,它将无法工作。请参阅文档中的警告

您的代码正在从“”中删除链接,因为常规爬行器忽略了
规则
命令

此代码应适用于:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from dmoz.items import DmozItem
from scrapy.selector import HtmlXPathSelector

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python"
    ]
    rules = [Rule(SgmlLinkExtractor(allow=[r'Books']), callback='parse_item')] 


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = DmozItem()

        # Extract links
        item['link'] = hxs.select("//li/a/text()").extract()  # Xpath selector for tag(s)

        print item['link']

        for cont, i in enumerate(item['link']):
            print "link: ", cont, i