Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/349.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用python scrapy提取(href,alt)对_Python_Html_Html Parsing_Scrapy - Fatal编程技术网

如何使用python scrapy提取(href,alt)对

如何使用python scrapy提取(href,alt)对,python,html,html-parsing,scrapy,Python,Html,Html Parsing,Scrapy,我有一个html页面(seed),格式如下: <div class="sth1"> <table cellspacing="6" width="600"> <tr> <td> <a href="link1"><img alt="alt1" border="0" height="22" src="img1" width="92"></a>

我有一个html页面
(seed)
,格式如下:

<div class="sth1">
    <table cellspacing="6" width="600">
        <tr>
            <td>
                <a href="link1"><img alt="alt1" border="0" height="22" src="img1" width="92"></a>
            </td>
            <td>
                <a href="link1">name1</a>
            </td>
            <td>
                <a href="link2"><img alt="alt2" border="0" height="22" src="img2" width="92"></a>
            </td>
            <td>
                <a href="link2">name2</a>
            </td>
        </tr>
    </table>
</div>

以下是来自以下方面的示例:


其中
index.html
包含问题中提供的示例html。

您可以尝试将Scrapy的内置功能与Python的zip()结合使用:

link1, alt1
link2, alt2  
$ scrapy shell index.html
In [1]: for cell in response.xpath("//div[@class='sth1']/table/tr/td"):
   ...:     href = cell.xpath("a/@href").extract()   
   ...:     alt = cell.xpath("a/img/@alt").extract()
   ...:     print href, alt

[u'link1'] [u'alt1']
[u'link1'] []
[u'link2'] [u'alt2']
[u'link2'] []
from scrapy.selector import SelectorList

xpq = '//div[@class="sth1"]/table/tr/td[./a/img]'
cells = SelectorList(response.xpath(xpq))

zip(cells.xpath('a/@href'), cells.xpath('a/img/@alt'))
=> [('link1', 'alt1'), ('link2', 'alt2')]