Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将web抓取提取限制为每个xpath项一次,返回的副本太多_Python_Xpath_Web Crawler_Scrapy - Fatal编程技术网

Python 将web抓取提取限制为每个xpath项一次,返回的副本太多

Python 将web抓取提取限制为每个xpath项一次,返回的副本太多,python,xpath,web-crawler,scrapy,Python,Xpath,Web Crawler,Scrapy,我正在使用以下基于web的爬行脚本来提取的某些元素,但是,它一次又一次地返回相同的信息,这使我必须进行的后期处理变得复杂,有没有一种好方法可以将这些提取限制为每个xpath项一次 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector #from hz_sample.items import HzSampleItem class DmozSpider(BaseSpider):

我正在使用以下基于web的爬行脚本来提取的某些元素,但是,它一次又一次地返回相同的信息,这使我必须进行的后期处理变得复杂,有没有一种好方法可以将这些提取限制为每个xpath项一次

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
#from hz_sample.items import HzSampleItem

class DmozSpider(BaseSpider):
    name = "hzIII"
    allowed_domains = ["tool.httpcn.com"]
    start_urls = ["http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//p")

        for titles in titles:
        tester = titles.xpath('//*[@id="div_a1"]/div[3][1]').extract()
        #jester = titles.xpath('//*[@id="div_a1"]/div[2]').extract() 
            print tester
是我当前输出的样子(即指向dropbox文件的链接)

输出应如下所示:

[u'<div class="content16">\r\n<span class="zi18b">\u25ce \u57fa\u672c\u89e3\u91ca</span><br>\r\n\u6bd6 <br>b\xec <br>\u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af\uff09\u3002 <br>\u64cd\u52b3\uff1a\u201c\u65e0\u6bd6\u4e8e\u6064\u201d\u3002 <br>\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u5192\u51fa\u6d41\u6dcc\u7684\u6837\u5b50\u3002 <br> <br>\u7b14\u753b\u6570\uff1a9\uff1b <br>\u90e8\u9996\uff1a\u6bd4\uff1b <br>\u7b14\u987a\u7f16\u53f7\uff1a153545434 <br><br><br>\r\n</div>'] [u'<div class="text16"><span class="zi18b">\u25ce \u5b57\u5f62\u7ed3\u6784</span><br>[ <span class="b">\u9996\u5c3e\u5206\u89e3\u67e5\u5b57</span> ]\uff1a\u6bd4\u5fc5(bibi)\n\u3000[ <span class="b">\u6c49\u5b57\u90e8\u4ef6\u6784\u9020</span> ]\uff1a\u6bd4\u5fc5\n<br>[ <span class="b">\u7b14\u987a\u7f16\u53f7</span> ]\uff1a153545434<br>\n[ <span class="b">\u7b14\u987a\u8bfb\u5199</span> ]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a<br>\n<br><hr class="hr"></div>']
\r\n\r\n\n\n\u25C\u25C\u25C\u25C\u25C\n\n\n\u6bd6\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u














\\\u7b 14\u7b14\u714\u753b\u753b\u753b\u5192\u5192\u5192\u5192\u5192\u5192\u5112\uffff10\uuuuuuuuuuuuuuuuuu51财政\ff7\uu5112\ff4\ff4\u5192\f4\f4\uf4\u5192\u5192\uff4\u5192\u5192\u5192\u8\f4\uf4\u8\f4\u8\u8\u8\u8\f4\u8\f4\u8\u8\f4\u6\u6\6bd4\u5fc5(bibi)\n\u3000[\u6c49\u5b57\u90e8\u4ef6\u6784\u9020]\uff1a\u6bd4\u5fc5\n
[\u7b14\u987a\u7f16\u53f7]\uff1a1535434\br>\n[\u7b14\u987a\u8bfb\u5199]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a\u637a\u637a\u637a\u637a
\n
但目前的安排是,想要产出的次数太多了,比如:

[u'<div class="content16">\r\n<span class="zi18b">\u25ce \u57fa\u672c\u89e3\u91ca</span><br>\r\n\u6bd6 <br>b\xec <br>\u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af\uff09\u3002 <br>\u64cd\u52b3\uff1a\u201c\u65e0\u6bd6\u4e8e\u6064\u201d\u3002 <br>\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u5192\u51fa\u6d41\u6dcc\u7684\u6837\u5b50\u3002 <br> <br>\u7b14\u753b\u6570\uff1a9\uff1b <br>\u90e8\u9996\uff1a\u6bd4\uff1b <br>\u7b14\u987a\u7f16\u53f7\uff1a153545434 <br><br><br>\r\n</div>'] [u'<div class="text16"><span class="zi18b">\u25ce \u5b57\u5f62\u7ed3\u6784</span><br>[ <span class="b">\u9996\u5c3e\u5206\u89e3\u67e5\u5b57</span> ]\uff1a\u6bd4\u5fc5(bibi)\n\u3000[ <span class="b">\u6c49\u5b57\u90e8\u4ef6\u6784\u9020</span> ]\uff1a\u6bd4\u5fc5\n<br>[ <span class="b">\u7b14\u987a\u7f16\u53f7</span> ]\uff1a153545434<br>\n[ <span class="b">\u7b14\u987a\u8bfb\u5199</span> ]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a<br>\n<br><hr class="hr"></div>']
[u'<div class="content16">\r\n<span class="zi18b">\u25ce \u57fa\u672c\u89e3\u91ca</span><br>\r\n\u6bd6 <br>b\xec <br>\u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af\uff09\u3002 <br>\u64cd\u52b3\uff1a\u201c\u65e0\u6bd6\u4e8e\u6064\u201d\u3002 <br>\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u5192\u51fa\u6d41\u6dcc\u7684\u6837\u5b50\u3002 <br> <br>\u7b14\u753b\u6570\uff1a9\uff1b <br>\u90e8\u9996\uff1a\u6bd4\uff1b <br>\u7b14\u987a\u7f16\u53f7\uff1a153545434 <br><br><br>\r\n</div>'] [u'<div class="text16"><span class="zi18b">\u25ce \u5b57\u5f62\u7ed3\u6784</span><br>[ <span class="b">\u9996\u5c3e\u5206\u89e3\u67e5\u5b57</span> ]\uff1a\u6bd4\u5fc5(bibi)\n\u3000[ <span class="b">\u6c49\u5b57\u90e8\u4ef6\u6784\u9020</span> ]\uff1a\u6bd4\u5fc5\n<br>[ <span class="b">\u7b14\u987a\u7f16\u53f7</span> ]\uff1a153545434<br>\n[ <span class="b">\u7b14\u987a\u8bfb\u5199</span> ]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a<br>\n<br><hr class="hr"></div>']
[u'<div class="content16">\r\n<span class="zi18b">\u25ce \u57fa\u672c\u89e3\u91ca</span><br>\r\n\u6bd6 <br>b\xec <br>\u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af\uff09\u3002 <br>\u64cd\u52b3\uff1a\u201c\u65e0\u6bd6\u4e8e\u6064\u201d\u3002 <br>\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u5192\u51fa\u6d41\u6dcc\u7684\u6837\u5b50\u3002 <br> <br>\u7b14\u753b\u6570\uff1a9\uff1b <br>\u90e8\u9996\uff1a\u6bd4\uff1b <br>\u7b14\u987a\u7f16\u53f7\uff1a153545434 <br><br><br>\r\n</div>'] [u'<div class="text16"><span class="zi18b">\u25ce \u5b57\u5f62\u7ed3\u6784</span><br>[ <span class="b">\u9996\u5c3e\u5206\u89e3\u67e5\u5b57</span> ]\uff1a\u6bd4\u5fc5(bibi)\n\u3000[ <span class="b">\u6c49\u5b57\u90e8\u4ef6\u6784\u9020</span> ]\uff1a\u6bd4\u5fc5\n<br>[ <span class="b">\u7b14\u987a\u7f16\u53f7</span> ]\uff1a153545434<br>\n[ <span class="b">\u7b14\u987a\u8bfb\u5199</span> ]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a<br>\n<br><hr class="hr"></div>']
[u'<div class="content16">\r\n<span class="zi18b">\u25ce \u57fa\u672c\u89e3\u91ca</span><br>\r\n\u6bd6 <br>b\xec <br>\u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af\uff09\u3002 <br>\u64cd\u52b3\uff1a\u201c\u65e0\u6bd6\u4e8e\u6064\u201d\u3002 <br>\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u5192\u51fa\u6d41\u6dcc\u7684\u6837\u5b50\u3002 <br> <br>\u7b14\u753b\u6570\uff1a9\uff1b <br>\u90e8\u9996\uff1a\u6bd4\uff1b <br>\u7b14\u987a\u7f16\u53f7\uff1a153545434 <br><br><br>\r\n</div>'] [u'<div class="text16"><span class="zi18b">\u25ce \u5b57\u5f62\u7ed3\u6784</span><br>[ <span class="b">\u9996\u5c3e\u5206\u89e3\u67e5\u5b57</span> ]\uff1a\u6bd4\u5fc5(bibi)\n\u3000[ <span class="b">\u6c49\u5b57\u90e8\u4ef6\u6784\u9020</span> ]\uff1a\u6bd4\u5fc5\n<br>[ <span class="b">\u7b14\u987a\u7f16\u53f7</span> ]\uff1a153545434<br>\n[ <span class="b">\u7b14\u987a\u8bfb\u5199</span> ]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a<br>\n<br><hr class="hr"></div>']
[u'<div class="content16">\r\n<span class="zi18b">\u25ce \u57fa\u672c\u89e3\u91ca</span><br>\r\n\u6bd6 <br>b\xec <br>\u8c28\u614e\uff1a\u60e9\u524d\u6bd6\u540e\uff08\u63a5\u53d7\u8fc7\u53bb\u5931\u8d25\u7684\u6559\u8bad\uff0c\u4ee5\u540e\u5c0f\u5fc3\u4e0d\u91cd\u72af\uff09\u3002 <br>\u64cd\u52b3\uff1a\u201c\u65e0\u6bd6\u4e8e\u6064\u201d\u3002 <br>\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u5192\u51fa\u6d41\u6dcc\u7684\u6837\u5b50\u3002 <br> <br>\u7b14\u753b\u6570\uff1a9\uff1b <br>\u90e8\u9996\uff1a\u6bd4\uff1b <br>\u7b14\u987a\u7f16\u53f7\uff1a153545434 <br><br><br>\r\n</div>'] [u'<div class="text16"><span class="zi18b">\u25ce \u5b57\u5f62\u7ed3\u6784</span><br>[ <span class="b">\u9996\u5c3e\u5206\u89e3\u67e5\u5b57</span> ]\uff1a\u6bd4\u5fc5(bibi)\n\u3000[ <span class="b">\u6c49\u5b57\u90e8\u4ef6\u6784\u9020</span> ]\uff1a\u6bd4\u5fc5\n<br>[ <span class="b">\u7b14\u987a\u7f16\u53f7</span> ]\uff1a153545434<br>\n[ <span class="b">\u7b14\u987a\u8bfb\u5199</span> ]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a<br>\n<br><hr class="hr"></div>']
\r\n\r\n\n\n\u25C\u25C\u25C\u25C\u25C\n\n\n\u6bd6\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u














\\\u7b 14\u7b14\u714\u753b\u753b\u753b\u5192\u5192\u5192\u5192\u5192\u5192\u5112\uffff10\uuuuuuuuuuuuuuuuuu51财政\ff7\uu5112\ff4\ff4\u5192\f4\f4\uf4\u5192\u5192\uff4\u5192\u5192\u5192\u8\f4\uf4\u8\f4\u8\u8\u8\u8\f4\u8\f4\u8\u8\f4\u6\u6\6bd4\u5fc5(bibi)\n\u3000[\u6c49\u5b57\u90e8\u4ef6\u6784\u9020]\uff1a\u6bd4\u5fc5\n
[\u7b14\u987a\u7f16\u53f7]\uff1a1535434\br>\n[\u7b14\u987a\u8bfb\u5199]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a\u637a\u637a\u637a\u637a
\n \r\n\r\n\n\n\u25C\u25C\u25C\u25C\u25C\n\n\n\u6bd6\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u














\\\u7b 14\u7b14\u714\u753b\u753b\u753b\u5192\u5192\u5192\u5192\u5192\u5192\u5112\uffff10\uuuuuuuuuuuuuuuuuu51财政\ff7\uu5112\ff4\ff4\u5192\f4\f4\uf4\u5192\u5192\uff4\u5192\u5192\u5192\u8\f4\uf4\u8\f4\u8\u8\u8\u8\f4\u8\f4\u8\u8\f4\u6\u6\6bd4\u5fc5(bibi)\n\u3000[\u6c49\u5b57\u90e8\u4ef6\u6784\u9020]\uff1a\u6bd4\u5fc5\n
[\u7b14\u987a\u7f16\u53f7]\uff1a1535434\br>\n[\u7b14\u987a\u8bfb\u5199]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a\u637a\u637a\u637a\u637a
\n \r\n\r\n\n\n\u25C\u25C\u25C\u25C\u25C\n\n\n\u6bd6\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u














\\\u7b 14\u7b14\u714\u753b\u753b\u753b\u5192\u5192\u5192\u5192\u5192\u5192\u5112\uffff10\uuuuuuuuuuuuuuuuuu51财政\ff7\uu5112\ff4\ff4\u5192\f4\f4\uf4\u5192\u5192\uff4\u5192\u5192\u5192\u8\f4\uf4\u8\f4\u8\u8\u8\u8\f4\u8\f4\u8\u8\f4\u6\u6\6bd4\u5fc5(bibi)\n\u3000[\u6c49\u5b57\u90e8\u4ef6\u6784\u9020]\uff1a\u6bd4\u5fc5\n
[\u7b14\u987a\u7f16\u53f7]\uff1a1535434\br>\n[\u7b14\u987a\u8bfb\u5199]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a\u637a\u637a\u637a\u637a
\n \r\n\r\n\n\n\u25C\u25C\u25C\u25C\u25C\n\n\n\u6bd6\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u














\\\u7b 14\u7b14\u714\u753b\u753b\u753b\u5192\u5192\u5192\u5192\u5192\u5192\u5112\uffff10\uuuuuuuuuuuuuuuuuu51财政\ff7\uu5112\ff4\ff4\u5192\f4\f4\uf4\u5192\u5192\uff4\u5192\u5192\u5192\u8\f4\uf4\u8\f4\u8\u8\u8\u8\f4\u8\f4\u8\u8\f4\u6\u6\6bd4\u5fc5(bibi)\n\u3000[\u6c49\u5b57\u90e8\u4ef6\u6784\u9020]\uff1a\u6bd4\u5fc5\n
[\u7b14\u987a\u7f16\u53f7]\uff1a1535434\br>\n[\u7b14\u987a\u8bfb\u5199]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a\u637a\u637a\u637a\u637a
\n \r\n\r\n\n\n\u25C\u25C\u25C\u25C\u25C\n\n\n\u6bd6\u53e4\u540c\u201c\u6ccc\u201d\uff0c\u6cc9\u6c34\u














\\\u7b 14\u7b14\u714\u753b\u753b\u753b\u5192\u5192\u5192\u5192\u5192\u5192\u5112\uffff10\uuuuuuuuuuuuuuuuuu51财政\ff7\uu5112\ff4\ff4\u5192\f4\f4\uf4\u5192\u5192\uff4\u5192\u5192\u5192\u8\f4\uf4\u8\f4\u8\u8\u8\u8\f4\u8\f4\u8\u8\f4\u6\u6\6bd4\u5fc5(bibi)\n\u3000[\u6c49\u5b57\u90e8\u4ef6\u6784\u9020]\uff1a\u6bd4\u5fc5\n
[\u7b14\u987a\u7f16\u53f7]\uff1a1535434\br>\n[\u7b14\u987a\u8bfb\u5199]\uff1a\u6a2a\u6298\u6487\u6298\u637a\u6298\u637a\u6487\u637a\u637a\u637a\u637a\u637a
\n
我想你想要的是

 tester = titles.xpath('(//*[@id="div_a1"]/div[3])[1]').extract()
如果“限制提取”的意思是只检索结果集的第一个节点,而不是这样做
print tester[0]
def parse(self, response):
        hxs = HtmlXPathSelector(response)
        root = hxs.select("/")

        retester = root.xpath('//*[@id="div_a1"]/div[2]').extract()
        tester = root.xpath('//*[@id="div_a1"]/div[3]').extract() 
        print tester, retester
class Spider(BaseSpider):
    name = "hzIII"
    allowed_domains = ["tool.httpcn.com"]
    start_urls = ["http://tool.httpcn.com/Html/Zi/28/PWMETBAZTBTBBDTB.shtml"]
    def parse(self, response):
        print  response.xpath('//*[@id="div_a1"]/div[2]').extract()
        print  response.xpath('//*[@id="div_a1"]/div[3]').extract()