Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/301.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 用于嵌套网页的爬行器不';行不通_Python_Python 2.7_Web Scraping_Web Crawler_Scrapy - Fatal编程技术网

Python 用于嵌套网页的爬行器不';行不通

Python 用于嵌套网页的爬行器不';行不通,python,python-2.7,web-scraping,web-crawler,scrapy,Python,Python 2.7,Web Scraping,Web Crawler,Scrapy,下面是蜘蛛不会抓取网站,我想知道我是否使用了错误的代码来抓取同一网站中的多个页面。 下面是TestScrpy.py的代码: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlL

下面是蜘蛛不会抓取网站,我想知道我是否使用了错误的代码来抓取同一网站中的多个页面。 下面是TestScrpy.py的代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor

class CraigslistSampleItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    date = scrapy.Field()
    description=scrapy.Field()

class SiteSpider(CrawlSpider):
  name = "newscrap"
  #download_delay = 2
  allowed_domains = ['example.com']
  start_urls = ['http://example.com/page/1']
  items = {}


def parse(self, response):
    sel = Selector(response)
    #requests =[]
    brands = sel.xpath("//div[@class='thumb']")
    for brand in brands:
        item = CraigslistSampleItem()
        url = brand.xpath("./a/@href")[0].extract()
        item['url'] = brand.xpath("./a/@href")[0].extract()
        item ["title"] = brand.xpath("./a/@title").extract()
        item ["date"] = brands.select("//span/text()").extract()[counter]
        counter=counter+1
        request = Request(url,callback=self.parse_model, meta={'item':item})
        yield request

def parse_model(self, response):
    sel = Selector(response)
    models = sel.xpath("//*[@id='blocks-left']/div[1]/div/div[5]/p")
    for model in models:
        item = CraigslistSampleItem(response.meta["item"])
        item ['description'] = model.xpath("//*[@id='blocks-left']/div[1]/div/div[5]/p")[0].extract()
        yield item
上述程序的目的是从一页中读取标题、url、日期。。使用读取的url,应该从url中删除某个项目的描述。
有人能告诉我实现在同一个网站上抓取嵌套页面的逻辑吗。如果您能分享一些嵌套spider的工作示例,这将非常有帮助

既然您要求提供蜘蛛/爬虫工作示例,我就与您分享。对我来说,爬行的逻辑很简单,因此很容易理解

您的spider代码(至少)有两个错误:

  • 您使用爬行器和解析回调。不要这样做,因为文件上说这行不通。改用常规的
    卡盘

  • 您不缩进
    parse
    块。缩进


  • 嗨,我需要一个使用Scrapy抓取嵌套页面的工作示例。。您共享的是使用BeautifulSoup..@user3128771,怎么样?这正是错误所在?您需要共享正在爬网的url以检查功能。