Python 当Scrapy spider使用os.system从脚本而不是从命令行调用时,LinkedExtractor不工作
我从我的爬行蜘蛛身上得到了一些奇怪的行为,我无法解释,任何建议都很感激!它被配置为根据alecxe对该问题的回答从脚本运行: 下面是我的爬行器(Python 当Scrapy spider使用os.system从脚本而不是从命令行调用时,LinkedExtractor不工作,python,web-scraping,web-crawler,scrapy,scrapy-spider,Python,Web Scraping,Web Crawler,Scrapy,Scrapy Spider,我从我的爬行蜘蛛身上得到了一些奇怪的行为,我无法解释,任何建议都很感激!它被配置为根据alecxe对该问题的回答从脚本运行: 下面是我的爬行器(sdcrawler.py)的脚本。如果我从命令行调用它(例如“pythonSdclawler.py'myegur.com”http://www.myEGurl.com/testdomain'./outputfolder/''testdomain/'”),然后LinkedExtractor将很好地跟踪页面上的链接,并进入parse_item回调函数以处理它
sdcrawler.py
)的脚本。如果我从命令行调用它(例如“pythonSdclawler.py'myegur.com”http://www.myEGurl.com/testdomain'./outputfolder/''testdomain/'
”),然后LinkedExtractor将很好地跟踪页面上的链接,并进入parse_item
回调函数以处理它找到的任何链接。但是,如果我尝试从Python脚本调用与os.system()
完全相同的命令,那么对于某些页面(不是所有页面),爬行器不会跟随任何链接,也不会进入parse\u项
回调函数。我似乎无法获得任何输出或错误消息来理解为什么在本例中这些页面没有调用parse\u item
。我添加的print
语句确认肯定调用了\uuuuu init\uuuu
,但随后爬行器关闭。我不明白为什么我要将我在os.system()
中使用的“python sdcrawler.py…”
”命令粘贴到命令行并运行它,然后为完全相同的参数调用parse\u函数
爬行蜘蛛代码:
class SDSpider(CrawlSpider):
name = "sdcrawler"
# requires 'domain', 'start_page', 'folderpath' and 'sub_domain' to be passed as string arguments IN THIS PARTICULAR ORDER!!!
def __init__(self):
self.allowed_domains = [sys.argv[1]]
self.start_urls = [sys.argv[2]]
self.folder = sys.argv[3]
try:
os.stat(self.folder)
except:
os.makedirs(self.folder)
sub_domain = sys.argv[4]
self.rules = [Rule(LinkExtractor(allow=sub_domain), callback='parse_item', follow=True)]
print settings['CLOSESPIDER_PAGECOUNT']
super(SDSpider, self).__init__()
def parse_item(self, response):
# check for correctly formatted HTML page, ignores crap pages and PDFs
print "entered parse_item\n"
if re.search("<!\s*doctype\s*(.*?)>", response.body, re.IGNORECASE) or 'HTML' in response.body[0:10]:
s = 1
else:
s = 0
if response.url[-4:] == '.pdf':
s = 0
if s:
filename = response.url.replace(":","_c_").replace(".","_o_").replace("/","_l_") + '.htm'
if len(filename) > 255:
filename = filename[0:220] + '_filename_too_long_' + str(datetime.datetime.now().microsecond) + '.htm'
wfilename = self.folder + filename
with open(wfilename, 'wb') as f:
f.write(response.url)
f.write('\n')
f.write(response.body)
print "i'm writing a html!\n"
print response.url+"\n"
else:
print "s is zero, not scraping\n"
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
print "spider closing\n"
# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('DEPTH_LIMIT', 5)
settings.set('CLOSESPIDER_PAGECOUNT', 100)
settings.set('DOWNLOAD_DELAY', 3)
settings.set('USER_AGENT', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko)')
# breadth-first crawl (depth-first is default, comment the below 3 lines out to run depth-first)
settings.set('DEPTH_PRIORITY', 1)
settings.set('SCHEDULER_DISK_QUEUE', 'scrapy.squeue.PickleFifoDiskQueue')
settings.set('SCHEDULER_MEMORY_QUEUE', 'scrapy.squeue.FifoMemoryQueue')
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = SDSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start the reactor (blocks execution)
reactor.run()
def execute_spider(SDfile, homepageurl):
folderpath = SDfile.rsplit('/',1)[0] + '/'
outputfolder = folderpath + 'htmls/'
try:
os.stat(outputfolder)
except:
os.makedirs(outputfolder)
SDsvisited = folderpath + 'SDsvisited.txt'
singlepagesvisited = folderpath + 'singlepagesvisited.txt'
# convert all_subdomains.txt to a list of strings
with open(SDfile) as f:
sdlist1 = f.readlines()
# remove duplicates from all_subdomains list
sdlist = list(set(sdlist1))
# set overall domain for this website, don't crawl outside their site (some of subdomains.txt will be external links)
domain = homepageurl
clean_domain = domain.split('.',1)[1]
# process sdlist: only keep over-arching subdomains and strip out single pages to be processed in a different way
#seenSDs = []
sdlistclean = []
singlepagelist = []
sdlist = sorted(sdlist)
for item in sdlist:
if item != '' and not item.isspace():
if '.' in item.split('/')[-1]:
if clean_domain in item:
singlepagelist.append(item)
else:
if item in sdlistclean:
pass
else:
if clean_domain in item:
sdlistclean.append(item)
# crawl cleaned subdomains and save html pages to outputfolder
for item in sdlistclean:
# check that you don't have a country multisite as your subdomain
SDchk = item.split('/')[-2]
if SDchk.isalpha() and len(SDchk) == 2 and SDchk != 'pr' and SDchk != 'PR' and SDchk != 'hr' and SDchk != 'HR':
subdomain = item.split('/')[-3]
elif re.match(r'[A-Za-z]{2}-[A-Za-z]{2}', SDchk): #SDchk == 'en-US' or SDchk == 'en-UK':
subdomain = item.split('/')[-3]
else:
subdomain = item.split('/')[-2]
cmd = 'python sdcrawler.py ' + '\'' + clean_domain + '\' ' + '\'' + item + '\' ' + '\'' + outputfolder + '\' '+ '\'' + subdomain + '/\''
print cmd
os.system(cmd)
我在打印cmd
之前,先打印os.system(cmd)
,如果我只是复制这个print
输出并在一个单独的终端中运行它,爬行蜘蛛会像我所期望的那样执行,访问链接并使用parse\u item
回调函数解析它们
打印sys.argv
的输出为:
['sdcrawler.py' 'example.com' 'http://example.com/testdomain/' './outputfolder/' 'testdomain/']
您能否演示如何使用
os.system()
运行脚本?另外,如果您打印出sys.argv
,您会得到什么?谢谢。总是同一个域/os.system()
调用有问题,还是随机发生的?您可以演示如何使用os.system()
运行脚本吗?另外,如果您打印出sys.argv
,您会得到什么?谢谢。总是同一个域/os.system()
调用有问题,还是随机发生的?