Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 如何用Scrasty在Python3中抓取整个网站并抓取每个网页的数据_Python 3.x_Web Scraping_Scrapy_Anaconda - Fatal编程技术网

Python 3.x 如何用Scrasty在Python3中抓取整个网站并抓取每个网页的数据

Python 3.x 如何用Scrasty在Python3中抓取整个网站并抓取每个网页的数据,python-3.x,web-scraping,scrapy,anaconda,Python 3.x,Web Scraping,Scrapy,Anaconda,我正在尝试抓取一个网站,并使用scrapy从Python3中的每个网页中抓取一些数据。我已经通过提供一个页面的url来为每个页面抓取数据,但现在我想为每个页面抓取数据。我想我遗漏了一些东西,因为我的代码无法抓取数据,所以无法抓取数据。我试过下面的代码,但没有成功。 我被困在这里,请帮帮我 我正在使用anaconda3和pycharm编译器 import scrapy from scrapy.crawler import CrawlerProcess from scrapy.linkextract

我正在尝试抓取一个网站,并使用scrapy从Python3中的每个网页中抓取一些数据。我已经通过提供一个页面的url来为每个页面抓取数据,但现在我想为每个页面抓取数据。我想我遗漏了一些东西,因为我的代码无法抓取数据,所以无法抓取数据。我试过下面的代码,但没有成功。 我被困在这里,请帮帮我 我正在使用anaconda3和pycharm编译器

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class QuotesSpider(CrawlSpider):
name = "quotes"
allowed_domains = ['meishij.net']
start_urls = [
    'https://www.meishij.net/'
    #'https://www.meishij.net/zuofa/huaguluobodunpaigutang.html'

]
rules = (
    Rule(LinkExtractor(), callback='parse_page', follow=True),
)

def parse(self, response):
    title = response.xpath('//*[(@id = "tongji_title")]/text()').extract_first()
    print(title)
    tags = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "yj_tags", " " ))]//a/text()').extract()
    print("Tags: ")
    print(tags)

    recipeDetails = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "small", " " ))]//text()').extract()
    author = response.xpath('//*[(@id = "tongji_author")]//text()').extract()
    print("Recipee Details and Author name: ")
    print(recipeDetails,author)
    description = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "materials", " " ))]//p/text()').extract()
    print("Recipee Description: ")
    print(description)
    steps = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "content", " " ))]//p[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]//text()').extract()
    print("Recipee Steps: ")
    print(steps)

    #tips = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "cpc_h2", " " ))]//p/text()').extract_first()
    tips= response.css('.cpc_h2+p::text').extract()
    print("Recipee Tips")
    print(tips)
    comments = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "p1", " " ))]//text()').extract()
    print("Comments")
    print(comments)

 process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start()

我已经做到了。实际上,我没有在callback属性中调用parse方法。我已经更改了这行代码

rules = ( 
   Rule(LinkExtractor(allow=('zuofa')), callback='parse_web', follow=True)
)

parse_web
是解析定义的名称

什么不起作用?@AmiHollander我没有得到data@AmiHollander它只返回空的方括号。我不明白你怎么了it@AmiHollander我想它只是为我完成的第一个URL调用了parse方法。实际上,我没有在callback属性中调用parse方法。我已经更改了这行代码“rules=(Rule(LinkExtractor(allow=('zuofa')),callback='parse\u-web',follow=True)”,parse\u-web是解析定义的名称