Python Scrapy:我的代码仅从1个URL进行解析
我正在尝试解析我的站点中包含“133199”的所有URL。 不幸的是,我的代码只解析整个站点中的一个url。应该有超过20k个URL 下面的代码正确地爬网了整个网站,并以某种方式解析了包含133199的第一个URL,但没有解析其余的URLPython Scrapy:我的代码仅从1个URL进行解析,python,web-crawler,scrapy,Python,Web Crawler,Scrapy,我正在尝试解析我的站点中包含“133199”的所有URL。 不幸的是,我的代码只解析整个站点中的一个url。应该有超过20k个URL 下面的代码正确地爬网了整个网站,并以某种方式解析了包含133199的第一个URL,但没有解析其余的URL from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextract
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "activewear"
allowed_domains = ["www.mydomain.com"]
start_urls = ["http://www.mydomain.com/",]
rules = (
Rule(SgmlLinkExtractor(allow=(),deny=('/[1-9]$', '(bti=)[1-9]+(?:\.[1-9]*)?', '(sort_by=)[a-zA-Z]', '(sort_by=)[1-9]+(?:\.[1-9]*)?', '(ic=32_)[1-9]+(?:\.[1-9]*)?', '(ic=60_)[0-9]+(?:\.[0-9]*)?', '(search_sort=)[1-9]+(?:\.[1-9]*)?', 'browse-ng.do\?', '/page/', '/ip/', 'out\+value', 'fn=', 'customer_rating', 'special_offers', 'search_sort=&', ))),
Rule (SgmlLinkExtractor(allow=('133199', ),)
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['referer'] = response.request.headers.get('Referer')
item['url'] = response.url
item['title'] = site.xpath('/html/head/title/text()').extract()
item['description'] = site.select('//meta[@name="Description"]/@content').extract()
item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract()
item['response'] = response.status
items.append(item)
return items
这是我的控制台日志中唯一被解析的URL。这个网站有几百万页,所以我无法显示整个日志
Scraped from <200 http://www.mydomain.com/browse/apparel/5438?_refineresult=true&facet=special_offers%3AClearance&ic=32_0&path=0%3A5438&povid=cat133199-env200983-moduleC052312-lLinkSubnav1Clearance>
{'canonical': [u'http://www.mydomain.com/browse/apparel/5438/'],
'description': [u"Shop for Apparel - mydomain.com. Buy products such as Disney Girls' Minnie Mouse 2 Piece Pajama Coat Set at mydomain and save."],
'referer': 'http://www.mydomain.com/cp/133199',
'response': 200,
'title': [u'\nApparel - mydomain.com\n'],
'url': 'http://www.mydomain.com/browse/apparel/5438?_refineresult=true&facet=special_offers%3AClearance&ic=32_0&path=0%3A5438&povid=cat133199-env200983-moduleC052312-lLinkSubnav1Clearance'}
2013-12-20 09:45:54-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/cp/Cats/202073?povid=P1171-C1110.2784+1455.2776+1115.2956-L440> (referer: http://www.mydomain.com/)
2013-12-20 09:45:54-0800 [activewear] DEBUG: Redirecting (301) to <GET http://www.mydomain.com/browse/pets/birds/5440_228734/?amp;ic=48_0&ref=243033.244996&catNavId=5440&povid=P1171-C1110.2784+1455.2776+1115.2956-L439> from <GET http://www.mydomain.com/browse/Birds/_/N-591gZaq90Zaqce/Ne-57ix?amp%3Bic=48_0&%3Bref=243033.244996&%3Btab_All=&catNavId=5440&povid=P1171-C1110.2784+1455.2776+1115.2956-L439>
2013-12-20 09:45:54-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/team-sports/soccer/4125_4161_432196?povid=P1171-C1110.2784+1455.2776+1115.2956-L277> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/sports-outdoors/golf/4125_4152?povid=P1171-C1110.2784+1455.2776+1115.2956-L276> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/team-sports/football/4125_4161_434036?povid=P1171-C1110.2784+1455.2776+1115.2956-L275> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/cp/1164750?povid=P1171-C1110.2784+1455.2776+1115.2956-L362> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/gifts-registry/specialty-gift-cards/1094765_96894_972339?povid=P1171-C1110.2784+1455.2776+1115.2956-L361> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/cp/pet-supplies/5440?povid=P1171-C1110.2784+1455.2776+1115.2956-L438> (referer: http://www.mydomain.com/)
从
{'canonical':[u'http://www.mydomain.com/browse/apparel/5438/'],
‘描述’:[u“Shop for Apparel-mydomain.com。在mydomain购买迪士尼女孩Minnie Mouse两件套睡衣外套,然后保存。”,
“referer”:http://www.mydomain.com/cp/133199',
"答复":200,,
'title':[u'\naparel-mydomain.com\n'],
“url”:”http://www.mydomain.com/browse/apparel/5438?_refineresult=true&facet=special_offers%3AClearance&ic=32_0&path=0%3A5438&povid=cat133199-env200983-moduleC052312-LLINKSUBNAV1CLEANCE'}
2013-12-20 09:45:54-0800[activewear]调试:爬网(200)(参考:http://www.mydomain.com/)
2013-12-20 09:45:54-0800[activewear]调试:重定向(301)到
2013-12-20 09:45:54-0800[activewear]调试:爬网(200)(参考:http://www.mydomain.com/)
2013-12-20 09:45:55-0800[activewear]调试:爬网(200)(参考:http://www.mydomain.com/)
2013-12-20 09:45:55-0800[activewear]调试:爬网(200)(参考:http://www.mydomain.com/)
2013-12-20 09:45:55-0800[activewear]调试:爬网(200)(参考:http://www.mydomain.com/)
2013-12-20 09:45:55-0800[activewear]调试:爬网(200)(参考:http://www.mydomain.com/)
2013-12-20 09:45:55-0800[activewear]调试:爬网(200)(参考:http://www.mydomain.com/)
您能分享您的控制台日志吗?此外,规则的顺序也很重要。“如果多个规则匹配同一链接,将根据在此属性中定义的顺序使用第一个规则。”因此,如果发生更改,您可以尝试更改两个规则的顺序anything@pault.I我花了一些时间来改变规则。当我把这些规则用到以下方面时,我把这些规则用到了:<代码>规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则规则::<代码>规则(SGMLMLLinked莱克特者(允许('1331199199199199,,,,,,,)、回调(解析项目),以下,跟进他们他们他们他们他们他们他们,他们他们他们他们他们的他们他们他们他们他们的情况下,他们,他们他们他们他们他们的情况下)、回收收收收,他们他们的回叫,他们,他们他们他们会(他们的项目,他们,他们,他们的项目,他们他们他们他们他们他们他们他们他们,他们,他们,他们,他们,他们,他们,他们,他们的项目,他们,他们,他们他们,他们,他们,他们,他们,他们,他们,他们,他们的\[0-9]*)?','(search\u sort=)[1-9]+(?:\.[1-9]*)?,“browse ng.do\?”,“/page/”,“/ip/”,“out\+value”,“fn=”,“customer\u rating”,“special\u offers”,“search\u sort=,),),)它不解析任何内容。