Scrapy 刮擦链接提取器不';行不通

Scrapy 刮擦链接提取器不';行不通,scrapy,Scrapy,我正在努力刮这一页 我有一条规则: rules = ( Rule( SgmlLinkExtractor(allow=r'storeId='), callback="parse_item" ), ) 页面上有16个链接,但此规则只能找到13个。如果我将该页面保存在本地并尝试,那么它会找到所有16个页面 这让我抓狂,这个网页怎么了?你可以使用另一个链接提取器,比如RegexLinkExtractor,而不是SgmlLinkExtractor paul

我正在努力刮这一页

我有一条规则:

rules = (
    Rule(
        SgmlLinkExtractor(allow=r'storeId='),
        callback="parse_item"
    ),
)
页面上有16个链接,但此规则只能找到13个。如果我将该页面保存在本地并尝试,那么它会找到所有16个页面


这让我抓狂,这个网页怎么了?

你可以使用另一个链接提取器,比如
RegexLinkExtractor
,而不是
SgmlLinkExtractor

paul@machine:~$ scrapy shell "http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results"
...
2014-06-11 15:49:43+0000 [default] INFO: Spider opened
2014-06-11 15:49:43+0000 [default] DEBUG: Crawled (200) <GET http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x2961210>
[s]   item       {}
[s]   request    <GET http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results>
[s]   response   <200 http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <Spider 'default' at 0x2fab5d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: from scrapy.contrib.linkextractors.regex import RegexLinkExtractor

In [2]: lx = RegexLinkExtractor(allow=r'storeId=')

In [3]: lx.extract_links(response)
Out[3]: 
[Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1626', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=3183', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1632', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1627', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1628', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1642', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1641', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1623', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1634', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1625', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1630', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=2176', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1619', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1622', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1599', text=u'VIEW STORE DETAILS', fragment='', nofollow=False),
 Link(url='http://www.stevemadden.com/custserv/store_details.jsp?storeId=1636', text=u'VIEW STORE DETAILS', fragment='', nofollow=False)]

In [4]: len(lx.extract_links(response))
Out[4]: 16

In [5]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [6]: lx = SgmlLinkExtractor(allow=r'storeId=')

In [7]: len(lx.extract_links(response))
Out[7]: 13

In [8]: 
paul@machine:~$scrapy shell“http://www.stevemadden.com/custserv/locate_store.cmd?useCurrentLocation=yes&findUSStore=no&findAllStore=false&radius=0&countryCode=CA#results"
...
2014-06-11 15:49:43+0000[默认]信息:蜘蛛网已打开
2014-06-11 15:49:43+0000[默认]调试:爬网(200)(参考:无)
[s] 可用的刮擦对象:
[s] 爬虫
[s] 项目{}
[s] 请求
[s] 回应
[s] 背景
[s] 蜘蛛
[s] 有用的快捷方式:
[s] shelp()Shell帮助(打印此帮助)
[s] 获取(请求或url)获取请求(或url)并更新本地对象
[s] 查看(响应)在浏览器中查看响应
在[1]中:从scrapy.contrib.linkextractors.regex导入RegexLinkExtractor
在[2]中:lx=RegexLinkExtractor(allow=r'storeId=)
[3]:lx.提取链接(响应)
出[3]:
[链接(url=]http://www.stevemadden.com/custserv/store_details.jsp?storeId=1626',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=3183',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1632',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1627',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1628',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1642',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1641',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1623',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1634',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1625',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1630',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=2176',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1619',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1622',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1599',text=u'VIEW STORE DETAILS',fragment='',nofollow=False),
链接(url=)http://www.stevemadden.com/custserv/store_details.jsp?storeId=1636',text=u'VIEW STORE DETAILS',fragment='',nofollow=False)]
在[4]中:len(lx.extract_links(response))
Out[4]:16
在[5]中:从scrapy.contrib.linkextractors.sgml导入SgmlLinkExtractor
在[6]中:lx=SgmlLinkExtractor(allow=r'storeId=)
在[7]中:len(lx.extract_links(response))
Out[7]:13
在[8]中:

我没有看到任何包含
storeId
的链接。我得到一个“找不到页面”。你确定链接吗?@paultrmbrth是的,链接确实有效。您可能想使用web代理,因为根据您的位置,某些网站可能不可用。事实上,我通过美国代理进行了检查
SGMLLinkedExtractor
找到的链接比在响应上使用XPath进行简单检查要少。