Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/348.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 当Scrapy spider使用os.system从脚本而不是从命令行调用时,LinkedExtractor不工作_Python_Web Scraping_Web Crawler_Scrapy_Scrapy Spider - Fatal编程技术网

Python 当Scrapy spider使用os.system从脚本而不是从命令行调用时,LinkedExtractor不工作

Python 当Scrapy spider使用os.system从脚本而不是从命令行调用时,LinkedExtractor不工作,python,web-scraping,web-crawler,scrapy,scrapy-spider,Python,Web Scraping,Web Crawler,Scrapy,Scrapy Spider,我从我的爬行蜘蛛身上得到了一些奇怪的行为,我无法解释,任何建议都很感激!它被配置为根据alecxe对该问题的回答从脚本运行: 下面是我的爬行器(sdcrawler.py)的脚本。如果我从命令行调用它(例如“pythonSdclawler.py'myegur.com”http://www.myEGurl.com/testdomain'./outputfolder/''testdomain/'”),然后LinkedExtractor将很好地跟踪页面上的链接,并进入parse_item回调函数以处理它

我从我的爬行蜘蛛身上得到了一些奇怪的行为,我无法解释,任何建议都很感激!它被配置为根据alecxe对该问题的回答从脚本运行:

下面是我的爬行器(
sdcrawler.py
)的脚本。如果我从命令行调用它(例如“
pythonSdclawler.py'myegur.com”http://www.myEGurl.com/testdomain'./outputfolder/''testdomain/'
”),然后LinkedExtractor将很好地跟踪页面上的链接,并进入
parse_item
回调函数以处理它找到的任何链接。但是,如果我尝试从Python脚本调用与
os.system()
完全相同的命令,那么对于某些页面(不是所有页面),爬行器不会跟随任何链接,也不会进入
parse\u项
回调函数。我似乎无法获得任何输出或错误消息来理解为什么在本例中这些页面没有调用
parse\u item
。我添加的
print
语句确认肯定调用了
\uuuuu init\uuuu
,但随后爬行器关闭。我不明白为什么我要将我在
os.system()
中使用的“
python sdcrawler.py…”
”命令粘贴到命令行并运行它,然后为完全相同的参数调用
parse\u函数

爬行蜘蛛代码:

class SDSpider(CrawlSpider):
    name = "sdcrawler"

    # requires 'domain', 'start_page', 'folderpath' and 'sub_domain' to be passed as string arguments IN THIS PARTICULAR ORDER!!!
    def __init__(self):
        self.allowed_domains = [sys.argv[1]]
        self.start_urls = [sys.argv[2]]
        self.folder = sys.argv[3]
        try:
            os.stat(self.folder)
        except:
            os.makedirs(self.folder)
        sub_domain = sys.argv[4]
        self.rules = [Rule(LinkExtractor(allow=sub_domain), callback='parse_item', follow=True)]
        print settings['CLOSESPIDER_PAGECOUNT']
        super(SDSpider, self).__init__()


    def parse_item(self, response):
        # check for correctly formatted HTML page, ignores crap pages and PDFs
        print "entered parse_item\n"
        if re.search("<!\s*doctype\s*(.*?)>", response.body, re.IGNORECASE) or 'HTML' in response.body[0:10]:
            s = 1
        else:
            s = 0
        if response.url[-4:] == '.pdf':
            s = 0

        if s:
            filename = response.url.replace(":","_c_").replace(".","_o_").replace("/","_l_") + '.htm'
            if len(filename) > 255:
                filename = filename[0:220] + '_filename_too_long_' + str(datetime.datetime.now().microsecond) + '.htm'
            wfilename = self.folder + filename
            with open(wfilename, 'wb') as f:
                f.write(response.url)
                f.write('\n')
                f.write(response.body)
                print "i'm writing a html!\n"
                print response.url+"\n"
        else:
            print "s is zero, not scraping\n"

# callback fired when the spider is closed
def callback(spider, reason):
    stats = spider.crawler.stats.get_stats()  # collect/log stats?

    # stop the reactor
    reactor.stop()
    print "spider closing\n"


# instantiate settings and provide a custom configuration
settings = Settings()

settings.set('DEPTH_LIMIT', 5)
settings.set('CLOSESPIDER_PAGECOUNT', 100)
settings.set('DOWNLOAD_DELAY', 3)
settings.set('USER_AGENT', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko)')

# breadth-first crawl (depth-first is default, comment the below 3 lines out to run depth-first)
settings.set('DEPTH_PRIORITY', 1)
settings.set('SCHEDULER_DISK_QUEUE', 'scrapy.squeue.PickleFifoDiskQueue')
settings.set('SCHEDULER_MEMORY_QUEUE', 'scrapy.squeue.FifoMemoryQueue')

# instantiate a crawler passing in settings
crawler = Crawler(settings)

# instantiate a spider
spider = SDSpider()

# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)

# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()

# start the reactor (blocks execution)
reactor.run()
def execute_spider(SDfile, homepageurl):

    folderpath = SDfile.rsplit('/',1)[0] + '/' 
    outputfolder = folderpath + 'htmls/'
    try:
        os.stat(outputfolder)
    except:
        os.makedirs(outputfolder)

    SDsvisited = folderpath + 'SDsvisited.txt'
    singlepagesvisited = folderpath + 'singlepagesvisited.txt'

    # convert all_subdomains.txt to a list of strings
    with open(SDfile) as f:
        sdlist1 = f.readlines()

    # remove duplicates from all_subdomains list
    sdlist = list(set(sdlist1))

    # set overall domain for this website, don't crawl outside their site (some of subdomains.txt will be external links)
    domain = homepageurl
    clean_domain = domain.split('.',1)[1]

    # process sdlist: only keep over-arching subdomains and strip out single pages to be processed in a different way 
    #seenSDs = []
    sdlistclean = []
    singlepagelist = []
    sdlist = sorted(sdlist)

    for item in sdlist:
        if item != '' and not item.isspace():
            if '.' in item.split('/')[-1]:
                if clean_domain in item:
                    singlepagelist.append(item)
            else:
                if item in sdlistclean:
                    pass
                else:
                    if clean_domain in item:
                        sdlistclean.append(item)

    # crawl cleaned subdomains and save html pages to outputfolder
    for item in sdlistclean:

        # check that you don't have a country multisite as your subdomain
        SDchk = item.split('/')[-2]
        if SDchk.isalpha() and len(SDchk) == 2 and SDchk != 'pr' and SDchk != 'PR' and SDchk != 'hr' and SDchk != 'HR':
            subdomain =  item.split('/')[-3]

        elif re.match(r'[A-Za-z]{2}-[A-Za-z]{2}', SDchk): #SDchk == 'en-US' or SDchk == 'en-UK':
            subdomain =  item.split('/')[-3]
        else:
            subdomain = item.split('/')[-2]

        cmd = 'python sdcrawler.py ' + '\'' + clean_domain +  '\' ' + '\'' + item  + '\' ' + '\'' + outputfolder + '\' '+ '\'' + subdomain + '/\''

        print cmd
        os.system(cmd)
我在打印
cmd
之前,先打印
os.system(cmd)
,如果我只是复制这个
print
输出并在一个单独的终端中运行它,爬行蜘蛛会像我所期望的那样执行,访问链接并使用
parse\u item
回调函数解析它们

打印
sys.argv
的输出为:

['sdcrawler.py' 'example.com' 'http://example.com/testdomain/' './outputfolder/' 'testdomain/']

您能否演示如何使用
os.system()
运行脚本?另外,如果您打印出
sys.argv
,您会得到什么?谢谢。总是同一个域/
os.system()
调用有问题,还是随机发生的?您可以演示如何使用
os.system()
运行脚本吗?另外,如果您打印出
sys.argv
,您会得到什么?谢谢。总是同一个域/
os.system()
调用有问题,还是随机发生的?