Python 刮除标记属性内的元素-刮除_Python_Tags_Web Scraping_Scrapy_Embed

Python 刮除标记属性内的元素-刮除

python tags web-scraping scrapy

Python 刮除标记属性内的元素-刮除,python,tags,web-scraping,scrapy,embed,Python,Tags,Web Scraping,Scrapy,Embed,我正在用Scrapy刮一个视频网站。我刮东西有点困难前上述声明给出了以下结果：- id\u video=7845976&theskin=default&url\u bigtumb=” allowscriptaccess=“始终”allowfullscreen=“true”quality=“高” src=”http://static.sample.com/swf/xv-player.swf 我需要一个hxs.select语句，这样它就可以从上面的嵌入代码中只提取图像url，如下所示：- 我试

我正在用Scrapy刮一个视频网站。我刮东西有点困难

前

上述声明给出了以下结果：-

id\u video=7845976&theskin=default&url\u bigtumb=” allowscriptaccess=“始终”allowfullscreen=“true”quality=“高” src=”http://static.sample.com/swf/xv-player.swf

我需要一个hxs.select语句，这样它就可以从上面的嵌入代码中只提取图像url，如下所示：-

我试过：-

item['thumb'] = hxs.select("//embed[@id='flash-player-embed']/@flashvars/@url_bigthumb").extract()[0]

但是它没有用，因为它不起作用

非常感谢Scrapy或Python委员会的任何帮助，因为这将节省我宝贵的兆字节

提前谢谢。

我的建议是您可以使用分割函数来获得准确的结果

比如说,

hxs.select('//embed[@id='flash-player-embed']/@flashvars').extract()[0].split('url_bigthumb=')[1].split('key')[0].replace('&amp;','').strip().replace('&','').strip()

这是目前为止你可以使用的最简单的方法，但是你可以等待好的答案

谢谢

使用regex的快速解决方案是：

re.findall(r'http?://[^\s<>&"]+|www\.[^\s<>&"]+', item['thumb'])[0]

re.findall（r'http？：//[^\s&“]+| www\.[^\s&“]+”，项目['thumb']）[0]

使用.re（）方法在XPath选择之后使用正则表达式：

>>sel=选择器（文本=“”）
>>>sel.xpath（“//embed/@flashvars”）.re（'url_bigtumb=（[^&]+'））
[u'http://sample.com/image.jpg']

阅读更多：

还为获取元素提供了一个很好的解决方案：

>>from urlparse import parse_qs, urlparse
>>url = '?' + 'id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf'

>>print parse_qs(urlparse(url).query)['url_bigthumb']
['http://sample.com/image.jpg']

非常感谢。你的解决方案奏效了。我不得不稍微修改一下。但是让我知道这是否会给我提供最快的爬网速度，或者我是否应该从这里提到的其他答案中尝试一些东西。实际上，您可以暂时使用这种方式，为了同时使用您的脚本，请通过（Nima Soroush）使用URLPrase来尝试答案。我也试过了，效果很好。

re.findall(r'http?://[^\s<>&"]+|www\.[^\s<>&"]+', item['thumb'])[0]

>>> sel = Selector(text="""<embed width="588" height="476" flashvars="id_video=7845976&amp;theskin=default&amp;url_bigthumb=http://sample.com/image.jpg&amp;key=4219e347d8fdc0be3103eb3cbb458258-1416371743&amp;categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf" wmode="transparent" id="flash-player-embed" type="application/x-shockwave-flash">""")
>>> sel.xpath("//embed/@flashvars").re('url_bigthumb=([^&]+)')
[u'http://sample.com/image.jpg']

>>from urlparse import parse_qs, urlparse
>>url = '?' + 'id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf'

>>print parse_qs(urlparse(url).query)['url_bigthumb']
['http://sample.com/image.jpg']