Python 什么';在一个网站上抓取多个页面,而不使用scrapy为每个页面生成/创建请求/方法,这是一种有效的方法吗?

Python 什么';在一个网站上抓取多个页面,而不使用scrapy为每个页面生成/创建请求/方法,这是一种有效的方法吗?,python,web-scraping,scrapy,twisted,scrapy-spider,Python,Web Scraping,Scrapy,Twisted,Scrapy Spider,举个例子,我用的是Yelp。Yelp不列出电子邮件,所以如果你想获得Yelp电子邮件,你需要抓取一个列表,然后向该列表网站发出请求,并抓取它以获取电子邮件。目前,我正在列表网站的主页上爬行,如果该页面上没有列出电子邮件、电话号码等,那么我将加载联系人页面并在那里进行检查。我遇到的问题是,我寻找的信息并不总是在这些页面上。理想的做法是加载网站上包含特定关键字的所有链接,然后创建一个方法,在所有这些页面中查找电子邮件、电话号码等,找到后返回。做这件事的好方法是什么?以下是我目前在网站页面中爬行的方式

举个例子,我用的是Yelp。Yelp不列出电子邮件,所以如果你想获得Yelp电子邮件,你需要抓取一个列表,然后向该列表网站发出请求,并抓取它以获取电子邮件。目前,我正在列表网站的主页上爬行,如果该页面上没有列出电子邮件、电话号码等,那么我将加载联系人页面并在那里进行检查。我遇到的问题是,我寻找的信息并不总是在这些页面上。理想的做法是加载网站上包含特定关键字的所有链接,然后创建一个方法,在所有这些页面中查找电子邮件、电话号码等,找到后返回。做这件事的好方法是什么?以下是我目前在网站页面中爬行的方式:

        rules = (
            Rule(LinkExtractor(allow=r'biz', restrict_xpaths='//*[contains(@class, "natural-search-result")]//a[@class="biz-name"]'), callback='parse_item', follow=True),
            Rule(LinkExtractor(allow=r'start', restrict_xpaths='//a[contains(@class, "prev-next")]'), follow=True)
        )

        def parse_item(self, response):
            i = YelpscraperItem()
            i['phone'] = self.beautify(response.xpath('//*[@class="biz-phone"]/text()').extract())
            i['state'] = self.beautify(response.xpath('//span[@itemprop="addressRegion"]/text()').extract())
            i['company'] = self.beautify(response.xpath('//h1[contains(@class, "biz-page-title")]/text()').extract())

            website = i['website'] = self.beautify(response.xpath('//div[@class="biz-website"]/a/text()').extract())
            if type(website) is list and website:
                website = self.checkScheme(website[0])
                request = Request(website, callback=self.parse_home_page, dont_filter=True)
                request.meta['item'] = i
                yield request
            else:
                yield i

        def parse_home_page(self, response):
            try:
                i = response.meta['item']
                sel = Selector(response)
                rawEmail = sel.xpath("substring-after(//a[starts-with(@href, 'mailto:')]/@href, 'mailto:')").extract()
                if (type(rawEmail) is list) and ('@' in rawEmail[0]):
                    i = self.format_email(rawEmail, i, "Home Page (Link)")
                    yield i
                else:
                    rawContactPage = response.xpath("//a[contains(@href, 'contact')]/@href").extract()
                    if type(rawContactPage) is list and rawContactPage:
                        contactPage = rawContactPage[0]
                        contactPage = urlparse.urljoin(response.url, contactPage.strip())
                        request = Request(contactPage, callback=self.parse_contact_page, dont_filter=True)
                        request.meta['item'] = i
                        request.meta['home-page-response'] = response
                        yield request
                    else:
                        yield i
            except TypeError as er:
                print er

        def parse_contact_page(self, response):
            try:
                i = response.meta['item']
                homePageResponse = response.meta['home-page-response']
                rawEmail = response.xpath("substring-after(//a[starts-with(@href, 'mailto:')]/@href, 'mailto:')").extract()
                if (type(rawEmail) is list) and ('@' in rawEmail[0]):
                    i = self.format_email(rawEmail, i, "Contact Page (Link)")
                elif (type(rawEmail) is list) and (rawEmail[0] == ''):
                    rawEmail = response.xpath('//body').re(r'[a-zA-Z0-9\.\-+_]+@[a-zA-Z0-9\.\-+_]+\.[A-Za-z]{2,3}')
                    if (type(rawEmail) is list) and rawEmail:
                        i = self.format_email(rawEmail, i, "Contact Page (Text)")
                    else:
                        rawEmail = homePageResponse.xpath('//body').re(r'[a-zA-Z0-9\.\-+_]+@[a-zA-Z0-9\.\-+_]+\.[A-Za-z]{2,3}')
                        if (type(rawEmail) is list) and rawEmail:
                            i = self.format_email(rawEmail, i, "Home Page (Text)")
                        else:
                            rawEmail = [self.get_whois_email(i)]
                            i = self.format_email(rawEmail, i, "Whois Page")
                yield i
            except TypeError as er:
                print er

        def get_whois_email(self, i):
            email = ""
            try:
                if 'website' in i.keys():
                    website = i['website']
                    if type(website) is list:
                        website = i['website'][0].lower()
                    w = whois.whois(website)
                    for whoisEmail in w.emails:
                        whoisEmail = whoisEmail.lower()
                        if website in whoisEmail:
                            email = whoisEmail
                        else:
                            for domain in self.whiteListed:
                                    if domain in whoisEmail:
                                        email = whoisEmail
            except IndexError as er:
                log.msg("Whois Email IndexError:")
            return email

这就是Scrapy的工作方式,因为它基于Twisted,一个异步框架。每个爬网页面都由一个回调处理


我认为将信息从一个回调传递到另一个回调的唯一方法是通过请求的meta属性执行操作

这就是Scrapy的工作方式,因为它基于Twisted,一种异步框架。每个爬网页面都由一个回调处理

我认为将信息从一个回调传递到另一个回调的唯一方法是通过请求的meta属性执行操作