Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/svn/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy爬虫函数未执行_Python_Scrapy - Fatal编程技术网

Python Scrapy爬虫函数未执行

Python Scrapy爬虫函数未执行,python,scrapy,Python,Scrapy,Python和crawling的新手,由于某些原因,下面的代码在调用函数时没有执行该函数,它甚至没有输出“test”print语句 主解析执行得很好,只是对函数的调用,我尝试了许多不同的方法来调用它,但都没有用 import scrapy from myproject.items import MyHierarchyItem class MySpider(scrapy.Spider): name = "myspider" allowed_domains = ["example.

Python和crawling的新手,由于某些原因,下面的代码在调用函数时没有执行该函数,它甚至没有输出“test”print语句

主解析执行得很好,只是对函数的调用,我尝试了许多不同的方法来调用它,但都没有用

import scrapy
from myproject.items import MyHierarchyItem

class MySpider(scrapy.Spider):
    name = "myspider"
    allowed_domains = ["example.com"]
    start_urls = ['example.com']

    def parse(self, response):

        print("Starting parse_hierarchy")
        HierarchyItem = MyHierarchyItem() 
        StartLvl3URLS = []
        sitemap = response.css("div.sitemap-content > div.row")

        for lvl1 in sitemap:            
            HierarchyItem["hierarchy_lvl1_name"] = lvl1.css("h2::text").extract()
            #print(lvl1.css("h2::text").extract())
            currentlvl2 = lvl1.css("li.span-6")

            for lvl2 in currentlvl2:
                HierarchyItem["hierarchy_lvl2_name"] = lvl2.css("h4::text").extract()
                currentlvl3 = lvl2.css("ul.child > li")
                #print(lvl2.css("h4::text").extract())

                for lvl3 in currentlvl3:
                    #print(lvl3.css("a::text").extract())
                    #print(lvl3.css("a::attr(href)").extract())
                    HierarchyItem["hierarchy_lvl3_name"] = lvl3.css("a::text").extract()
                    HierarchyItem["hierarchy_url"] = lvl3.css("a::attr(href)").extract()
                    StartLvl3URLS.append(HierarchyItem["hierarchy_url"])
                    yield HierarchyItem

        full_link = StartLvl3URLS[0]
        #for lvl3 in StartLvl3URLS
        yield scrapy.Request(str(full_link), self.parse_category)

    def parse_category(self, response):
        print("test")
        print(len(reponse.body))
        print(response.body)
原木提取物

2017-04-08 23:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.example.com/sitemap>
{'hierarchy_lvl1_name': ['cat1'],
 'hierarchy_lvl2_name': ['cat2'],
 'hierarchy_lvl3_name': ['cat3'],
 'hierarchy_url': ['http://www.example.com/cat1/cat2/cat3']}
2017-04-08 23:58:03 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-08 23:58:03 [scrapy.extensions.feedexport] INFO: Stored csv feed (445 items) in: hierarchy.csv
2017-04-08 23:58:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 205,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 24223,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 8, 13, 58, 3, 154254),
 'httpcache/hit': 1,
 'item_scraped_count': 445,
 'log_count/DEBUG': 447,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 4, 8, 13, 58, 2, 614750)}
2017-04-08 23:58:03 [scrapy.core.engine] INFO: Spider closed (finished)
2017-04-08 23:58:03[scrapy.core.scraper]调试:从
{'hierarchy_lvl1_name':['cat1'],
“层次结构2层名称”:[“cat2”],
“层次结构3层名称”:[“cat3”],
'hierarchy_url':['http://www.example.com/cat1/cat2/cat3']}
2017-04-08 23:58:03[刮屑芯发动机]信息:关闭卡盘(已完成)
2017-04-08 23:58:03[scrapy.extensions.feedexport]信息:存储在:hierarchy.csv中的csv提要(445项)
2017-04-08 23:58:03[scrapy.statscollectors]信息:倾销scrapy统计数据:
{'downloader/request_bytes':205,
“下载程序/请求计数”:1,
“downloader/request\u method\u count/GET”:1,
“downloader/response_字节”:24223,
“下载程序/响应计数”:1,
“下载程序/响应状态\计数/200”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2017,4,8,13,58,3,154254),
“httpcache/hit”:1,
“物料刮取计数”:445,
“日志计数/调试”:447,
“日志计数/信息”:8,
“响应\u已接收\u计数”:1,
“调度程序/出列”:1,
“调度程序/出列/内存”:1,
“调度程序/排队”:1,
“调度程序/排队/内存”:1,
“开始时间”:datetime.datetime(2017,4,8,13,58,2614750)}
2017-04-08 23:58:03[刮屑堆芯发动机]信息:十字轴关闭(完成)

据我所知,Scrapy不会通过
print()
方法打印输出打印

你能行

import logging
logging.info("message here")
logging.error("message here")
logging.warning("message here")

同时在浏览器中禁用JavaScript并打开您正在抓取的网站。然后检查
div.sitemap-content>div.row
选择器是否返回任何元素。

发现了问题,问题是因为我使用了
extract()
它的输出是一个列表,所以我在一个列表中有一个列表(只有一个元素),而请求没有调用url,所以将其更改为
extract\u first()
现在它可以工作了

HierarchyItem["hierarchy_url"] = lvl3.css("a::attr(href)").extract_first()