Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用python scrapy框架在主项目中抓取外部网站_Python_Python 2.7_Web Scraping_Scrapy_Scrapy Spider - Fatal编程技术网

使用python scrapy框架在主项目中抓取外部网站

使用python scrapy框架在主项目中抓取外部网站,python,python-2.7,web-scraping,scrapy,scrapy-spider,Python,Python 2.7,Web Scraping,Scrapy,Scrapy Spider,我一直在寻找一个更好的方法,从另一个主要来源的网站刮外部网站。为了更好地解释它,让我用一个yelp.com的例子来解释我想做什么(尽管我的目标不是yelp) 我会刮去头衔和地址 访问标题导致的链接以获取公司网站 我想从主网站的源代码中提取电子邮件。(我知道这很难,但我不是在抓取所有页面,我假设大多数网站的url中都有联系人,例如site.com/contact.php) 关键是,当我从yelp中抓取数据并将数据存储在一个字段中时,我想从公司的主网站获取外部数据 下面是我的代码,我不知道如何使用s

我一直在寻找一个更好的方法,从另一个主要来源的网站刮外部网站。为了更好地解释它,让我用一个yelp.com的例子来解释我想做什么(尽管我的目标不是yelp)

  • 我会刮去头衔和地址
  • 访问标题导致的链接以获取公司网站
  • 我想从主网站的源代码中提取电子邮件。(我知道这很难,但我不是在抓取所有页面,我假设大多数网站的url中都有联系人,例如site.com/contact.php)
  • 关键是,当我从yelp中抓取数据并将数据存储在一个字段中时,我想从公司的主网站获取外部数据 下面是我的代码,我不知道如何使用scrapy

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    from comb.items import CombItem, SiteItem
    
    class ComberSpider(CrawlSpider):
        name = "comber"
        allowed_domains = ["example.com"]
        query = 'shoe'
        page = 'http://www.example.com/corp/' + query + '/1.html'
        start_urls = (
            page,
        )
        rules = (Rule(LinkExtractor(allow=(r'corp/.+/\d+\.html'), restrict_xpaths=("//a[@class='next']")),
                      callback="parse_items", follow=True),
                 )
    
    
        def parse_items(self, response):
    
            for sel in response.xpath("//div[@class='item-main']"):
                item = CombItem()
                item['company_name'] = sel.xpath("h2[@class='title']/a/text()").extract()
                item['contact_url'] = sel.xpath("div[@class='company']/a/@href").extract()[0]
                item['gold_supplier'] = sel.xpath("div[@class='item-title']/a/@title").extract()[0]
                company_details = sel.xpath("div[@class='attrs']/div[@class='attr']/span['name']/text()").extract()
    
                item = self.parse_meta(sel, item, company_details)
                request = scrapy.Request(item['contact_url'], callback=self.parse_site)
                request.meta['item'] = item
    
                yield request
    
        def parse_meta(self, sel, item, company_details):
    
            if (company_details):
                if "Products:" in company_details:
                    item['products'] = sel.xpath("./div[@class='value']//text()").extract()
                if "Country/Region:" in company_details:
    
                    item['country'] = sel.xpath("./div[@class='right']"
                                            + "/span[@data-coun]/text()").extract()
                if "Revenue:" in company_details:
                    item['revenue'] = sel.xpath("./div[@class='right']/"
                                            + "span[@data-reve]/text()").extract()
                if "Markets:" in company_details:
                    item['markets'] = sel.xpath("./div[@class='value']/span[@data-mark]/text()").extract()
            return item
    
        def parse_site(self, response):
            item = response.meta['item']
            # this value of item['websites'] would be http://target-company.com, http://any-other-website.com
            # my aim is to jump to the http://company.com and scrap data from it's contact page and
            # store it as an item like item['emails'] = [info@company.com, sales@company.com]
    
            # Please how can this be done in this same project
            # the only thing i can think of is store the item['websites'] and other values of item and make another project
            # even with that it would still not work because of the allowed_domains and start_urls 
    
            item['websites'] = response.xpath("//div[@class='company-contact-information']/table/tr/td/a/@href").extract()
            print(item)
            print('*'* 50)
            yield item
    
    
    
    """
    
    from scrapy.item import Item, Field
    
    
    class CombItem(Item):
        company_name = Field()
        main_products = Field()
        contact_url = Field()
        revenue = Field()
        gold_supplier = Field()
        country = Field()
        markets= Field()
        Product_Home = Field()
        websites = Field()
        """
        #emails = Field() not implemented because emails need to be extracted from websites which is different from start_url
    

    当您发出
    请求时
    ,传递将关闭
    OffSiteMiddleware
    ,并且url将不会被
    允许的\u域过滤

    如果请求设置了dont_filter属性,则非现场 中间件将允许请求,即使其域未在列表中列出 允许的域