Python scrapy shell xpath从itunes.apple.com返回空列表_Python_Xpath_Scrapy

Python scrapy shell xpath从itunes.apple.com返回空列表

python xpath scrapy

Python scrapy shell xpath从itunes.apple.com返回空列表,python,xpath,scrapy,Python,Xpath,Scrapy,scrapy shell'https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign mpt=uo%3D4' 我想从这里得到专辑“没有眼泪可以哭-单身” 相册名称的xpath如下所示： /*[@id=“ember653”]/section[1]/div/div[2]/div[1]/div[2]/header/h1 我试着 r

scrapy shell'https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign mpt=uo%3D4'

我想从这里得到专辑“没有眼泪可以哭-单身”

相册名称的xpath如下所示：

/*[@id=“ember653”]/section[1]/div/div[2]/div[1]/div[2]/header/h1

我试着

response.xpath('//*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1')

但结果是

[]

我怎样才能从这个网站上获得唱片集信息呢？

这是因为scrapy不需要等待javascript加载，你需要使用scrapy项目的

scrapy splash

如果我使用

scrapy splash

我会得到结果

2018-06-30 20:50:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4%27 via http://localhost:8050/render.html> (referer: None)
2018-06-30 20:50:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4%27>
{'title': 'no tears left to cry - Single'}

您也可以使用

scrapy shell

scrapy shell 'http://localhost:8050/render.html?url=https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4'

In [2]: response.xpath('//*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1//text()').extract_first()
Out[2]: 'no tears left to cry - Single'

你最好避免JS渲染，这是他妈的缓慢，沉重和错误。花5分钟在Chrome的“网络”选项卡上查找数据源。它通常内置于页面源代码中或通过XHR请求传递

在这种情况下，您需要的所有数据都可以在页面本身上找到，但您应该检查其源代码，而不是呈现的版本。在chrome中使用

ctrl+u

，然后使用

ctrl+f

查找所有需要的零件

import json

track_data = response.xpath('//script[@name="schema:music-album"]/text()').extract_first()
track_json = json.loads(track_data)
track_title = track_json['name']
yield {'title': track_title}

在这种情况下，它的工作速度将是splash的5-7倍，谢谢！它真的有效！非常感谢你的帮助，我能再问一件事吗？/它在Itunes中运行得很好，但当我试图解析以刮取玩家的名字时，没有包括“名字”在内的脚本代码。如何解决这个问题？我建议您构建一个简单的基于javascript的小型网站，以便更好地理解数据交付的原则：ajax、服务器端渲染等，这是一种自我教育。这将是非常有趣和有用的（：相信我。|在这种情况下，正如我之前所说的，算法就像：进入chrome->开源页面->找不到任何东西->进入检查工具->网络->XHR->刷新页面->得到了三个XHR，它们是：哇……太棒了……最终我意识到了它的网络结构……我慢慢地按照你的指导，最终，我得到了我真正想要的东西……！我学会了ned XHR和网络在检查您的友好回答中的工具。非常感谢Michael！！！欢迎！祝您好运！（：

import json

track_data = response.xpath('//script[@name="schema:music-album"]/text()').extract_first()
track_json = json.loads(track_data)
track_title = track_json['name']
yield {'title': track_title}