Python 3.x 如何用Scrasty在Python3中抓取整个网站并抓取每个网页的数据
我正在尝试抓取一个网站,并使用scrapy从Python3中的每个网页中抓取一些数据。我已经通过提供一个页面的url来为每个页面抓取数据,但现在我想为每个页面抓取数据。我想我遗漏了一些东西,因为我的代码无法抓取数据,所以无法抓取数据。我试过下面的代码,但没有成功。 我被困在这里,请帮帮我 我正在使用anaconda3和pycharm编译器Python 3.x 如何用Scrasty在Python3中抓取整个网站并抓取每个网页的数据,python-3.x,web-scraping,scrapy,anaconda,Python 3.x,Web Scraping,Scrapy,Anaconda,我正在尝试抓取一个网站,并使用scrapy从Python3中的每个网页中抓取一些数据。我已经通过提供一个页面的url来为每个页面抓取数据,但现在我想为每个页面抓取数据。我想我遗漏了一些东西,因为我的代码无法抓取数据,所以无法抓取数据。我试过下面的代码,但没有成功。 我被困在这里,请帮帮我 我正在使用anaconda3和pycharm编译器 import scrapy from scrapy.crawler import CrawlerProcess from scrapy.linkextract
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(CrawlSpider):
name = "quotes"
allowed_domains = ['meishij.net']
start_urls = [
'https://www.meishij.net/'
#'https://www.meishij.net/zuofa/huaguluobodunpaigutang.html'
]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse(self, response):
title = response.xpath('//*[(@id = "tongji_title")]/text()').extract_first()
print(title)
tags = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "yj_tags", " " ))]//a/text()').extract()
print("Tags: ")
print(tags)
recipeDetails = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "small", " " ))]//text()').extract()
author = response.xpath('//*[(@id = "tongji_author")]//text()').extract()
print("Recipee Details and Author name: ")
print(recipeDetails,author)
description = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "materials", " " ))]//p/text()').extract()
print("Recipee Description: ")
print(description)
steps = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "content", " " ))]//p[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//text()').extract()
print("Recipee Steps: ")
print(steps)
#tips = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "cpc_h2", " " ))]//p/text()').extract_first()
tips= response.css('.cpc_h2+p::text').extract()
print("Recipee Tips")
print(tips)
comments = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "p1", " " ))]//text()').extract()
print("Comments")
print(comments)
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start()
我已经做到了。实际上,我没有在callback属性中调用parse方法。我已经更改了这行代码
rules = (
Rule(LinkExtractor(allow=('zuofa')), callback='parse_web', follow=True)
)
parse_web
是解析定义的名称什么不起作用?@AmiHollander我没有得到data@AmiHollander它只返回空的方括号。我不明白你怎么了it@AmiHollander我想它只是为我完成的第一个URL调用了parse方法。实际上,我没有在callback属性中调用parse方法。我已经更改了这行代码“rules=(Rule(LinkExtractor(allow=('zuofa')),callback='parse\u-web',follow=True)”,parse\u-web是解析定义的名称