Scrapy 刮皮不'；不要爬过这一页_Scrapy

Scrapy 刮皮不'；不要爬过这一页

scrapy

Scrapy 刮皮不'；不要爬过这一页,scrapy,Scrapy,我想抓取一个页面http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++搜索+++&sort.key=organism&sort.order=%2B按scrapy排序。但似乎有一个问题，我没有得到任何数据时，爬行它这是我的蜘蛛代码： import scrapy from scrapy.selector import Selector from

我想抓取一个页面

http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++搜索+++&sort.key=organism&sort.order=%2B

按scrapy排序。但似乎有一个问题，我没有得到任何数据时，爬行它

这是我的蜘蛛代码：

import scrapy
from scrapy.selector import Selector
from scrapy_Data.items import CharProt


class CPSpider(scrapy.Spider):

    name = "CharProt"
    allowed_domains = ["jcvi.org"]
    start_urls = ["http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B"]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//*[@id="middle_content_template"]/table/tbody/tr')

        for site in sites:
            item = CharProt()
            item['protein_name'] = site.xpath('td[1]/a/text()').extract()
            item['pn_link'] = site.xpath('td[1]/a/@href').extract()
            item['organism'] = site.xpath('td[2]/a/text()').extract()
            item['organism_link'] = site.xpath('td[2]/a/@href').extract()
            item['status'] = site.xpath('td[3]/a/text()').extract()
            item['status_link'] = site.xpath('td[3]/a/@href').extract()
            item['references'] = site.xpath('td[4]/a').extract()
            item['source'] = "CharProt"
            # collection.update({"protein_name": item['protein_name']}, dict(item), upsert=True)
            yield item

以下是日志：

2016-05-28 17:25:06 [scrapy] INFO: Spider opened
2016-05-28 17:25:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-28 17:25:06 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-28 17:25:07 [scrapy] DEBUG: Crawled (200) <GET http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B> (referer: None)
<200 http://www.jcvi.org/charprotdb/index.cgi/l_search?terms.1.field=all&terms.1.search_text=cancer&submit=+++Search+++&sort.key=organism&sort.order=%2B>
2016-05-28 17:25:08 [scrapy] INFO: Closing spider (finished)
2016-05-28 17:25:08 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 337,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 26198,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 5, 28, 9, 25, 8, 103577),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 5, 28, 9, 25, 6, 55848)}

2016-05-2817:25:06[剪贴]信息：蜘蛛打开
2016-05-28 17:25:06[抓取]信息：抓取0页（0页/分钟），抓取0项（0项/分钟）
2016-05-28 17:25:06[scrapy]调试：Telnet控制台监听127.0.0.1:6023
2016-05-28 17:25:07[scrapy]调试：爬网（200）（参考：无）
2016-05-28 17:25:08[刮擦]信息：关闭卡盘（已完成）
2016-05-28 17:25:08[刮痧]信息：倾销刮痧统计数据：
{'downloader/request_bytes'：337，
“下载程序/请求计数”：1，
“downloader/request\u method\u count/GET”：1，
“downloader/response_字节”：26198，
“下载程序/响应计数”：1，
“下载程序/响应状态\计数/200”：1，
“完成原因”：“完成”，
“完成时间”：datetime.datetime（2016,5,28,9,25,8103577），
“日志计数/调试”：2，
“日志计数/信息”：7，
“响应\u已接收\u计数”：1，
“调度程序/出列”：1，
“调度程序/出列/内存”：1，
“调度程序/排队”：1，
“调度程序/排队/内存”：1，
“开始时间”：datetime.datetime（2016,5,28,9,25,655848）}

当我运行其他蜘蛛时，它们都运行良好。有人能告诉我我的代码出了什么问题吗？或者此网页有问题？

您正在对其进行爬网，但xpath错误

当您使用浏览器检查一个元素时，

标记会出现，但它不在源代码中的任何位置，因此，不能对任何内容进行爬网

sites = sel.xpath('//*[@id="middle_content_template"]/table/tr')

这应该行得通

编辑

作为旁注，

extract（）

返回一个

列表

，而不是您想要的元素，因此您需要先使用

extract\u（）

方法或

extract（）[0]

乙二醇

您的xpath是错误的

访问表行不需要tbody
只需使用表格/tr
即可访问表格行

正确的xpath应该是：

sites = sel.xpath('//*[@id="middle_content_template"]//table//tr')

更好的xpath是

sites = response.xpath('//table[@class="search_results"]/tr')

正如您在上面的示例中所看到的，您不需要创建选择器对象通过

选择器（响应）

选择xpath

在较新的scrapy版本中，selector属性已经添加到响应类中，可以使用它，如下所述

response.selector.xpath（…）

或

缩写

response.xpath（…）

谢谢您的详细回答

sites = response.xpath('//table[@class="search_results"]/tr')