如何使用scrapy抓取多个页面？（两级）_Scrapy

如何使用scrapy抓取多个页面？（两级）

scrapy

如何使用scrapy抓取多个页面？（两级）,scrapy,Scrapy,在我的网站上，我创建了两个简单的页面：以下是他们的第一个html脚本： test1.html： <head> <title>test1</title> </head> <body> <a href="test2.html" onclick="javascript:return xt_click(this, "C", "1", "Product", "N");" indepth="true"> <span>coo

在我的网站上，我创建了两个简单的页面：以下是他们的第一个html脚本：

test1.html：

<head>
<title>test1</title>
</head>
<body>
<a href="test2.html" onclick="javascript:return xt_click(this, "C", "1", "Product", "N");" indepth="true">
<span>cool</span></a>
</body></html>

如何通过考试？如何成功地抓取第二页标题标签的文本？先谢谢你

要在代码中使用多个函数，发送多个请求并解析它们，您需要：1个yield而不是return，2个callback

例如：

def parse(self,response):
    for site in response.xpath('//head'):
        item = Website()
        item['title'] = site.xpath('//title/text()').extract()
        yield item
    yield scrapy.Request(url="http://www.domain.com", callback=self.other_function)

def other_function(self,response):
    for other_thing in response.xpath('//this_xpath')
        item = Website()
        item['title'] = other_thing.xpath('//this/and/that').extract()
        yield item

您无法使用scrapy解析javascript，但您可以理解javascript的功能，并执行相同的操作：

from scrapy.spider import Spider
from scrapy.selector import Selector

from testscrapy1.items import Website

class DmozSpider(Spider):
name = "bill"
allowed_domains = ["http://exemple.com"]
start_urls = [
    "http://www.exemple.com/test1.html"
]


def parse(self, response):

    sel = Selector(response)
    sites = sel.xpath('//head')
    items = []

    for site in sites:
        item = Website()

        item['title'] = site.xpath('//title/text()').extract()

        items.append(item)

    return items

def parse(self,response):
    for site in response.xpath('//head'):
        item = Website()
        item['title'] = site.xpath('//title/text()').extract()
        yield item
    yield scrapy.Request(url="http://www.domain.com", callback=self.other_function)

def other_function(self,response):
    for other_thing in response.xpath('//this_xpath')
        item = Website()
        item['title'] = other_thing.xpath('//this/and/that').extract()
        yield item