Python 用于嵌套网页的爬行器不';行不通
下面是蜘蛛不会抓取网站,我想知道我是否使用了错误的代码来抓取同一网站中的多个页面。 下面是TestScrpy.py的代码:Python 用于嵌套网页的爬行器不';行不通,python,python-2.7,web-scraping,web-crawler,scrapy,Python,Python 2.7,Web Scraping,Web Crawler,Scrapy,下面是蜘蛛不会抓取网站,我想知道我是否使用了错误的代码来抓取同一网站中的多个页面。 下面是TestScrpy.py的代码: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlL
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
class CraigslistSampleItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
date = scrapy.Field()
description=scrapy.Field()
class SiteSpider(CrawlSpider):
name = "newscrap"
#download_delay = 2
allowed_domains = ['example.com']
start_urls = ['http://example.com/page/1']
items = {}
def parse(self, response):
sel = Selector(response)
#requests =[]
brands = sel.xpath("//div[@class='thumb']")
for brand in brands:
item = CraigslistSampleItem()
url = brand.xpath("./a/@href")[0].extract()
item['url'] = brand.xpath("./a/@href")[0].extract()
item ["title"] = brand.xpath("./a/@title").extract()
item ["date"] = brands.select("//span/text()").extract()[counter]
counter=counter+1
request = Request(url,callback=self.parse_model, meta={'item':item})
yield request
def parse_model(self, response):
sel = Selector(response)
models = sel.xpath("//*[@id='blocks-left']/div[1]/div/div[5]/p")
for model in models:
item = CraigslistSampleItem(response.meta["item"])
item ['description'] = model.xpath("//*[@id='blocks-left']/div[1]/div/div[5]/p")[0].extract()
yield item
上述程序的目的是从一页中读取标题、url、日期。。使用读取的url,应该从url中删除某个项目的描述。
有人能告诉我实现在同一个网站上抓取嵌套页面的逻辑吗。如果您能分享一些嵌套spider的工作示例,这将非常有帮助 既然您要求提供蜘蛛/爬虫工作示例,我就与您分享。对我来说,爬行的逻辑很简单,因此很容易理解 您的spider代码(至少)有两个错误:
卡盘
parse
块。缩进嗨,我需要一个使用Scrapy抓取嵌套页面的工作示例。。您共享的是使用BeautifulSoup..@user3128771,怎么样?这正是错误所在?您需要共享正在爬网的url以检查功能。