Scrapy 刮擦链接提取器不';行不通
我正在努力刮这一页 我有一条规则:Scrapy 刮擦链接提取器不';行不通,scrapy,Scrapy,我正在努力刮这一页 我有一条规则: rules = ( Rule( SgmlLinkExtractor(allow=r'storeId='), callback="parse_item" ), ) 页面上有16个链接,但此规则只能找到13个。如果我将该页面保存在本地并尝试,那么它会找到所有16个页面 这让我抓狂,这个网页怎么了?你可以使用另一个链接提取器,比如RegexLinkExtractor,而不是SgmlLinkExtractor paul
rules = (
Rule(
SgmlLinkExtractor(allow=r'storeId='),
callback="parse_item"
),
)
页面上有16个链接,但此规则只能找到13个。如果我将该页面保存在本地并尝试,那么它会找到所有16个页面
这让我抓狂,这个网页怎么了?你可以使用另一个链接提取器,比如
RegexLinkExtractor
,而不是SgmlLinkExtractor
paul@machine:~$ scrapy shell "http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results"
...
2014-06-11 15:49:43+0000 [default] INFO: Spider opened
2014-06-11 15:49:43+0000 [default] DEBUG: Crawled (200) <GET http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x2961210>
[s] item {}
[s] request <GET http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results>
[s] response <200 http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x2fab5d0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: from scrapy.contrib.linkextractors.regex import RegexLinkExtractor
In [2]: lx = RegexLinkExtractor(allow=r'storeId=')
In [3]: lx.extract_links(response)
Out[3]:
[Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1626', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=3183', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1632', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1627', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1628', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1642', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1641', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1623', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1634', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1625', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1630', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=2176', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1619', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1622', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1599', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1636', text=u'VIEW STORE DETAILS', fragment='', nofollow=False)]
In [4]: len(lx.extract_links(response))
Out[4]: 16
In [5]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [6]: lx = SgmlLinkExtractor(allow=r'storeId=')
In [7]: len(lx.extract_links(response))
Out[7]: 13
In [8]:
paul@machine:~$scrapy shell“http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results"
...
2014-06-11 15:49:43+0000[默认]信息:蜘蛛网已打开
2014-06-11 15:49:43+0000[默认]调试:爬网(200)(参考:无)
[s] 可用的刮擦对象:
[s] 爬虫
[s] 项目{}
[s] 请求
[s] 回应
[s] 背景
[s] 蜘蛛
[s] 有用的快捷方式:
[s] shelp()Shell帮助(打印此帮助)
[s] 获取(请求或url)获取请求(或url)并更新本地对象
[s] 查看(响应)在浏览器中查看响应
在[1]中:从scrapy.contrib.linkextractors.regex导入RegexLinkExtractor
在[2]中:lx=RegexLinkExtractor(allow=r'storeId=)
[3]:lx.提取链接(响应)
出[3]:
[链接(url=]http://www.stevemadden.com/custserv/store_details.jsp?storeId=1626',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=3183',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1632',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1627',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1628',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1642',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1641',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1623',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1634',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1625',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1630',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=2176',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1619',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1622',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1599',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1636',text=u'VIEW STORE DETAILS',fragment='',nofollow=False)]
在[4]中:len(lx.extract_links(response))
Out[4]:16
在[5]中:从scrapy.contrib.linkextractors.sgml导入SgmlLinkExtractor
在[6]中:lx=SgmlLinkExtractor(allow=r'storeId=)
在[7]中:len(lx.extract_links(response))
Out[7]:13
在[8]中:
我没有看到任何包含storeId
的链接。我得到一个“找不到页面”。你确定链接吗?@paultrmbrth是的,链接确实有效。您可能想使用web代理,因为根据您的位置,某些网站可能不可用。事实上,我通过美国代理进行了检查SGMLLinkedExtractor
找到的链接比在响应上使用XPath进行简单检查要少。