Python Scrapy解析javascript

Python Scrapy解析javascript,python,regex,web-scraping,scrapy,web-crawler,Python,Regex,Web Scraping,Scrapy,Web Crawler,我在页面上有一个javascript,如下所示: new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli", 我要买“185310341”。我在谷歌上搜索了几个小时,但什么也找不到,希望你能帮我。我如何刮取javascript并获取id 我试过这个密码: id = sel.search('"id":(.*?),

我在页面上有一个javascript,如下所示:

new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",
我要买“185310341”。我在谷歌上搜索了几个小时,但什么也找不到,希望你能帮我。我如何刮取javascript并获取id

我试过这个密码:

id = sel.search('"id":(.*?),',text).group(1)
print id
但我得到了:

exceptions.AttributeError: 'Selector' object has no attribute 'search'
对于正则表达式,Scrapy选择器具有以下功能:

sel.xpath('<xpath_to_find_the_element_text>').re(r'"id":(\d+)')

regex方法的另一种替代方法是使用Javascript解析器,将该解析器的输出转换为XML文档,并使用XPath对其进行解析

这就是在中实现的,它使用and
lxml
(免责声明:我编写了js2xml;警告:不稳定)

在您的情况下,使用
js2xml.jsonlike.getall()
,检查这个示例scrapy shell会话:

paul:~$scrapy shellhttp://2loom.com/products/2loom-design-siyah-beyaz-kalpli
2014-05-19 16:12:00+0200[scrapy]信息:scrapy 0.23.0已启动(机器人:scrapybot)
2014-05-19 16:12:00+0200[scrapy]信息:可选功能可用:ssl、http11
2014-05-19 16:12:00+0200[scrapy]信息:覆盖的设置:{'LOGSTATS_INTERVAL':0}
2014-05-19 16:12:00+0200[scrapy]信息:启用的扩展:TelnetConsole、CloseSpider、WebService、CoreStats、SpiderState
2014-05-19 16:12:00+0200[scrapy]信息:启用的下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2014-05-19 16:12:00+0200[scrapy]信息:启用的spider中间件:HttpErrorMiddleware、OffsiteMiddleware、referermidleware、urlengthmiddleware、DepthMiddleware
2014-05-19 16:12:00+0200[scrapy]信息:启用的项目管道:
2014-05-19 16:12:00+0200[scrapy]调试:Telnet控制台在0.0.0.0:6023上侦听
2014-05-19 16:12:00+0200[scrapy]调试:在0.0.0.0:6080上侦听Web服务
2014-05-19 16:12:00+0200[默认]信息:蜘蛛网已打开
2014-05-19 16:12:01+0200[默认]调试:爬网(200)(参考:无)
[s] 可用的刮擦对象:
[s] 爬虫
[s] 项目{}
[s] 请求
[s] 回应
[s] 背景
[s] 蜘蛛
[s] 有用的快捷方式:
[s] shelp()Shell帮助(打印此帮助)
[s] 获取(请求或url)获取请求(或url)并更新本地对象
[s] 查看(响应)在浏览器中查看响应
/usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30:UserWarning:顶级“frontend”软件包已被弃用。它的所有子包都已移动到顶部的“IPython”级别。
警告(“顶级`frontend`软件包已被弃用。”
在[1]中:scripts=response.selector.xpath('//script/text()').extract()
在[2]中:导入js2xml,js2xml.jsonlike
在[3]中:js=js2xml.parse(脚本[-1])
[4]中:js2xml.jsonlike.getall(js)
出[4]:
[{'onVariantSelected':'selectCallback',
'product':{'available':True,
“按价格比较”:无,
“以最高价格比较”:0,
“以最低价格比较”:0,
“按价格比较”:False,
“内容”:u'Siyah-beyaz-Kallpli tulumlarimiz 100%聚酯olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir.Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131 olup kap\u015fonun tamam\u0131n\u0131 kapsar beyaz renklidir.Tulumlar\u0131 iki taraf\u0131 iki cepler\U013\u0131 Az Ferrayer\u01310131ca kar\u0131n BOLGESINDER cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r.Kad\u0131nlar ve erkekler i\Xe7位于塔萨兰姆\u0131\u015ft\u0131r,
“创建时间”:“2013-11-29T13:37:11+02:00”,
“描述”:u'Siyah-beyaz-Kallpli tulumlarimiz 100%聚酯olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir.Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131 olup kap\u015fonun tamam\u0131n\u0131 kapsar beyaz renklidir.Tulumlar\u0131 iki taraf\u0131 Ndaki cepler\U013\u0131 iki cepler\U013\U013\u0131yr\u0131ca kar\u0131n BOLGESINDER cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r.Kad\u0131nlar ve erkekler i\XE7位于塔萨兰姆\u0131\U015英尺\u0131r,
“特色图片”:“//cdn.shopify.com/s/files/1/0305/9953/products/11.\u Zwarte_hartjes_vk_girls.jpg?v=13892592261”,
“把手”:“2房间设计siyah beyaz kalpli”,
“id”:185310341,
“images':['//cdn.shopify.com/s/files/1/0305/9953/products/11.\u Zwarte\u hartjes\u vk\u girls.jpg?v=13892592261',
“//cdn.shopify.com/s/files/1/0305/9953/products/6.\u Zwarte\u hartjes\u aku girls.jpg?v=138925599”,
“//cdn.shopify.com/s/files/1/0305/9953/products/11.\u Zwarte\u hartjes\u vk\u boys.jpg?v=13892592264”,
“//cdn.shopify.com/s/files/1/0305/9953/products/6.\u Zwartje_hartjes_aku boys.jpg?v=13892592264'],
“选项”:[“大小”],
“价格”:15900,
“最高价格”:15900,
“价格最低”:15900,
“价格变化”:错误,
“发布时间”:“2013-11-29T13:34:20+02:00”,
“标记”:[u'2\xb7Loom',
“贝亚兹”,
"设计",,
“Ekrek”,
u'Kad\u0131n',
“Kalpli”,
“Lacivert”],
“标题”:“10.设计| Siyah&beyaz kalpli”,
“类型”:“2织机有限公司”,
'variants':[{'available':True,
“条形码”:无,
“按价格比较”:无,
“id”:424584985,
“库存管理”:“shopify”,
“库存策略”:“拒绝”,
“库存数量”:3,
“选项1”:“XS(34-36:1.60米-1.70米)”,
“选项2”:无,
“选项3”:无,
“选项”:[“XS(34-36:1.60米-1.70米)”,
“价格”:15900,
'requires_shipping':True,
“sku”:“T01-BLWH-1-XS”,
"应税":对,,
“标题”:XS(34-36:1.60米-1.70米),
“权重”:0},
{',
“条形码”:无,
“按价格比较”:无,
“id”:424584989,
“库存管理”:“shopify”,
“库存策略”:“拒绝”,
“库存数量”:3,
“选项1”:“S(36-38:1.65m-1.75m)”,
“选项2”:无,
“选项3”:无,
“选项”:['S(36-38:1.65m-1.75m)],
“价格”:15900,
'requires_shipping':True,
“sku”:“T01-BLWH-1-S”,
"应税":对,,
“标题”:“S(36-38:1.65米-1.75米)”,
>>> import re
>>> s = 'new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",'
>>> re.search('"id":(\d+)', s).group(1)
'185310341' 
paul:~$ scrapy shell http://2loom.com/products/2loom-design-siyah-beyaz-kalpli
2014-05-19 16:12:00+0200 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
2014-05-19 16:12:00+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-05-19 16:12:00+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled item pipelines: 
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-19 16:12:00+0200 [default] INFO: Spider opened
2014-05-19 16:12:01+0200 [default] DEBUG: Crawled (200) <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f8552946610>
[s]   item       {}
[s]   request    <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s]   response   <200 http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <Spider 'default' at 0x7f8552384b90>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
/usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated. All its subpackages have been moved to the top `IPython` level.
  warn("The top-level `frontend` package has been deprecated. "

In [1]: scripts = response.selector.xpath('//script/text()').extract()

In [2]: import js2xml, js2xml.jsonlike

In [3]: js = js2xml.parse(scripts[-1])

In [4]: js2xml.jsonlike.getall(js)
Out[4]: 
[{'onVariantSelected': 'selectCallback',
  'product': {'available': True,
   'compare_at_price': None,
   'compare_at_price_max': 0,
   'compare_at_price_min': 0,
   'compare_at_price_varies': False,
   'content': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
   'created_at': '2013-11-29T13:37:11+02:00',
   'description': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
   'featured_image': '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
   'handle': '2loom-design-siyah-beyaz-kalpli',
   'id': 185310341,
   'images': ['//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
    '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwarte_hartjes_ak_girls.jpg?v=1389259259',
    '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_boys.jpg?v=1389259264',
    '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwartje_hartjes_ak_boys.jpg?v=1389259264'],
   'options': ['Size'],
   'price': 15900,
   'price_max': 15900,
   'price_min': 15900,
   'price_varies': False,
   'published_at': '2013-11-29T13:34:20+02:00',
   'tags': [u'2\xb7Loom',
    'Beyaz',
    'Design',
    'Ekrek',
    u'Kad\u0131n',
    'Kalpli',
    'Lacivert'],
   'title': '10. Design | Siyah & beyaz kalpli',
   'type': '2 Loom Limiteds',
   'variants': [{'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584985,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 3,
     'option1': 'XS (34-36: 1.60m-1.70m)',
     'option2': None,
     'option3': None,
     'options': ['XS (34-36: 1.60m-1.70m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-XS',
     'taxable': True,
     'title': 'XS (34-36: 1.60m-1.70m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584989,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 3,
     'option1': 'S (36-38: 1.65m-1.75m)',
     'option2': None,
     'option3': None,
     'options': ['S (36-38: 1.65m-1.75m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-S',
     'taxable': True,
     'title': 'S (36-38: 1.65m-1.75m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584997,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 7,
     'option1': 'M (38-40: 1.70m-1.80m)',
     'option2': None,
     'option3': None,
     'options': ['M (38-40: 1.70m-1.80m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-M',
     'taxable': True,
     'title': 'M (38-40: 1.70m-1.80m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424585001,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 7,
     'option1': 'L (40-42: 1.75m-1.85m)',
     'option2': None,
     'option3': None,
     'options': ['L (40-42: 1.75m-1.85m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-L',
     'taxable': True,
     'title': 'L (40-42: 1.75m-1.85m)',
     'weight': 0}],
   'vendor': u'2\xb7Loom'}}]

In [5]: