Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/311.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/django/21.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从脚本运行爬行器时获取scrapy crawl命令的功能_Python_Django_Postgresql_Web Scraping_Scrapy - Fatal编程技术网

Python 从脚本运行爬行器时获取scrapy crawl命令的功能

Python 从脚本运行爬行器时获取scrapy crawl命令的功能,python,django,postgresql,web-scraping,scrapy,Python,Django,Postgresql,Web Scraping,Scrapy,我在scrapy项目中编写了一个爬行爬行器,它可以正确地从url中抓取数据,并将响应管道化到postgresql表中,但只有在使用scrapy-crawl命令时。当从项目根目录中的脚本运行spider时,似乎只调用spider类的parse方法,因为仅使用python命令运行脚本时没有创建表。我认为问题在于crawl命令有一个特定的协议,用于查找和调用spider包上方目录中的特定模块(例如模型、管道和设置模块),当spider从脚本运行时,这些模块不会被调用 我遵循了中包含的说明,但它们似乎并

我在scrapy项目中编写了一个爬行爬行器,它可以正确地从url中抓取数据,并将响应管道化到postgresql表中,但只有在使用scrapy-crawl命令时。当从项目根目录中的脚本运行spider时,似乎只调用spider类的parse方法,因为仅使用python命令运行脚本时没有创建表。我认为问题在于crawl命令有一个特定的协议,用于查找和调用spider包上方目录中的特定模块(例如模型、管道和设置模块),当spider从脚本运行时,这些模块不会被调用

我遵循了中包含的说明,但它们似乎并没有在数据被刮取后处理管道数据。这就提出了一个问题,我甚至应该尝试运行一个脚本来运行spider,或者我是否应该以某种方式使用scrapy crawl命令。问题是,我计划在django项目中运行scrapy spider,当用户以某种形式提交文本时,这会导致我的问题,但提供的答案似乎并没有解决我的问题。我还需要将表单中的文本传递给spider url(我以前只是使用raw_输入创建url)。我应该如何正确地运行蜘蛛? 如果需要的话,我有下面脚本和spider的代码。如能提供任何帮助/代码,将不胜感激,谢谢

脚本文件

from ticket_city_scraper import *
from ticket_city_scraper.spiders import tc_spider 

tc_spider.spiderCrawl()
import scrapy
import re
import json
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from ticket_city_scraper.items import ComparatorItem
from urlparse import urljoin

bandname = raw_input("Enter bandname\n")
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  

class MySpider3(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.ticketcity.com"]

    start_urls = [tc_url]
    tickets_list_xpath = './/div[@class = "vevent"]'
    def create_link(self, bandname):
        tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  
        self.start_urls = [tc_url]
        #return tc_url      

    tickets_list_xpath = './/div[@class = "vevent"]'

    def parse_json(self, response):
        loader = response.meta['loader']
        jsonresponse = json.loads(response.body_as_unicode())
        ticket_info = jsonresponse.get('B')
        price_list = [i.get('P') for i in ticket_info]
        if len(price_list) > 0:
            str_Price = str(price_list[0])
            ticketPrice = unicode(str_Price, "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        else:
            ticketPrice = unicode("sold out", "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        return loader.load_item()

    def parse_price(self, response):
        print "parse price function entered \n"
        loader = response.meta['loader']
        event_City = response.xpath('.//span[@itemprop="addressLocality"]/text()').extract() 
        eventCity = ''.join(event_City) 
        loader.add_value('eventCity' , eventCity)
        event_State = response.xpath('.//span[@itemprop="addressRegion"]/text()').extract() 
        eventState = ''.join(event_State) 
        loader.add_value('eventState' , eventState) 
        event_Date = response.xpath('.//span[@class="event_datetime"]/text()').extract() 
        eventDate = ''.join(event_Date)  
        loader.add_value('eventDate' , eventDate)    
        ticketsLink = loader.get_output_value("ticketsLink")
        json_id_list= re.findall(r"(\d+)[^-]*$", ticketsLink)
        json_id=  "".join(json_id_list)
        json_url = "https://www.ticketcity.com/Catalog/public/v1/events/" + json_id + "/ticketblocks?P=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"
        yield scrapy.Request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = True) 

    def parse(self, response):
        """
        # """
        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):
            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader
            loader.add_xpath('eventName' , './/span[@class="summary listingEventName"]/text()')
            loader.add_xpath('eventLocation' , './/div[@class="divVenue location"]/text()')
            loader.add_xpath('ticketsLink' , './/a[@class="divEventDetails url"]/@href')
            #loader.add_xpath('eventDateTime' , '//div[@id="divEventDate"]/@title') #datetime type
            #loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

            print "Here is ticket link \n" + loader.get_output_value("ticketsLink")
            #sel.xpath("//span[@id='PractitionerDetails1_Label4']/text()").extract()
            ticketsURL = "https://www.ticketcity.com/" + loader.get_output_value("ticketsLink")
            ticketsURL = urljoin(response.url, ticketsURL)
            yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price, dont_filter = True)

def spiderCrawl():
   process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
   })
   process.crawl(MySpider3)
   process.start()
蜘蛛文件

from ticket_city_scraper import *
from ticket_city_scraper.spiders import tc_spider 

tc_spider.spiderCrawl()
import scrapy
import re
import json
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from ticket_city_scraper.items import ComparatorItem
from urlparse import urljoin

bandname = raw_input("Enter bandname\n")
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  

class MySpider3(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.ticketcity.com"]

    start_urls = [tc_url]
    tickets_list_xpath = './/div[@class = "vevent"]'
    def create_link(self, bandname):
        tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  
        self.start_urls = [tc_url]
        #return tc_url      

    tickets_list_xpath = './/div[@class = "vevent"]'

    def parse_json(self, response):
        loader = response.meta['loader']
        jsonresponse = json.loads(response.body_as_unicode())
        ticket_info = jsonresponse.get('B')
        price_list = [i.get('P') for i in ticket_info]
        if len(price_list) > 0:
            str_Price = str(price_list[0])
            ticketPrice = unicode(str_Price, "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        else:
            ticketPrice = unicode("sold out", "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        return loader.load_item()

    def parse_price(self, response):
        print "parse price function entered \n"
        loader = response.meta['loader']
        event_City = response.xpath('.//span[@itemprop="addressLocality"]/text()').extract() 
        eventCity = ''.join(event_City) 
        loader.add_value('eventCity' , eventCity)
        event_State = response.xpath('.//span[@itemprop="addressRegion"]/text()').extract() 
        eventState = ''.join(event_State) 
        loader.add_value('eventState' , eventState) 
        event_Date = response.xpath('.//span[@class="event_datetime"]/text()').extract() 
        eventDate = ''.join(event_Date)  
        loader.add_value('eventDate' , eventDate)    
        ticketsLink = loader.get_output_value("ticketsLink")
        json_id_list= re.findall(r"(\d+)[^-]*$", ticketsLink)
        json_id=  "".join(json_id_list)
        json_url = "https://www.ticketcity.com/Catalog/public/v1/events/" + json_id + "/ticketblocks?P=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"
        yield scrapy.Request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = True) 

    def parse(self, response):
        """
        # """
        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):
            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader
            loader.add_xpath('eventName' , './/span[@class="summary listingEventName"]/text()')
            loader.add_xpath('eventLocation' , './/div[@class="divVenue location"]/text()')
            loader.add_xpath('ticketsLink' , './/a[@class="divEventDetails url"]/@href')
            #loader.add_xpath('eventDateTime' , '//div[@id="divEventDate"]/@title') #datetime type
            #loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

            print "Here is ticket link \n" + loader.get_output_value("ticketsLink")
            #sel.xpath("//span[@id='PractitionerDetails1_Label4']/text()").extract()
            ticketsURL = "https://www.ticketcity.com/" + loader.get_output_value("ticketsLink")
            ticketsURL = urljoin(response.url, ticketsURL)
            yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price, dont_filter = True)

def spiderCrawl():
   process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
   })
   process.crawl(MySpider3)
   process.start()

回答你的问题

  • Scrapy不区分爬网命令和爬网命令行(来自脚本)执行
  • 您缺少的唯一部分(和区别)是:

  • 抓取命令。。。始终且必须从内部执行 项目目录..scrapy.cfg文件所在的位置..如果 仔细看,它包含设置文件所在的位置 位于..和设置文件是所有 特定于项目的设置位于..像..缓存策略, 管道、标题设置、代理设置等 因此,在使用scrapy crawl时,所有这些设置都是内部加载的
  • 对于脚本中的零碎执行…您只需提供 爬行器的位置及其所在位置和执行位置 不使用settings.py文件中的任何自定义设置
  • 为使此设置生效..使用项目设置创建爬网流程对象

    settings = get_project_settings()
    settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
    process = CrawlerProcess(settings)
    process.crawl(MySpider3)
    process.start()
    

    谢谢,这就解决了问题!只是出于好奇,你在哪里找到这个代码的?我没有在文档部分看到它的常见做法,还有一些其他的事情,我想做的蜘蛛。它总是在scrapy文档中,不是所有在一个地方,但它总是在那里..由于这个项目是在测试版和公共领域..做广泛的搜索,检查git错误列表,其他答案,你会在那里发现一些有用的东西。文档中确实有一些关于这方面的信息。这里,如果你想做的其他事情与爬行器设置有关。。比settings.py文件中可用于全局设置的每个参数都多。您可以通过我上面提到的函数设置和使用这些参数。我现在正尝试从脚本在两个不同的项目中运行多个spider,因此我将脚本移动到了它所在的项目根目录的正上方,但是现在响应没有被管道化到数据库表中。如果脚本在项目根目录之外运行,是否必须对提供的代码进行更改?