Python 爬网完整域并将所有h1加载到项目中

Python 爬网完整域并将所有h1加载到项目中,python,scrapy,Python,Scrapy,我对python和scrapy比较陌生。我想要实现的是抓取一些网站,主要是公司网站。对整个域进行爬网并提取所有h1 h2 h3。创建一个记录,其中包含域名和一个包含该域中所有h1 h2 h3的字符串。基本上有一个域项和一个包含所有标题的大字符串 我希望输出是 域,字符串(h1,h2,h2)-来自此域上的所有URL 我遇到的问题是,每个URL都分为不同的项目。我知道我还没走多远,但如果你能给我一个正确的提示,我将不胜感激。基本上,我是如何创建一个外部循环的,以便yield语句一直运行到下一个域启动

我对python和scrapy比较陌生。我想要实现的是抓取一些网站,主要是公司网站。对整个域进行爬网并提取所有h1 h2 h3。创建一个记录,其中包含域名和一个包含该域中所有h1 h2 h3的字符串。基本上有一个域项和一个包含所有标题的大字符串

我希望输出是 域,字符串(h1,h2,h2)-来自此域上的所有URL

我遇到的问题是,每个URL都分为不同的项目。我知道我还没走多远,但如果你能给我一个正确的提示,我将不胜感激。基本上,我是如何创建一个外部循环的,以便yield语句一直运行到下一个域启动为止

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from Autotask_Prospecting.items import AutotaskProspectingItem
from Autotask_Prospecting.items import WebsiteItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from nltk import clean_html


class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = [ l.strip() for l in open('Domains.txt').readlines() ]
    start_urls = [ l.strip() for l in open('start_urls.txt').readlines() ]


    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_xpath('h1',("//h1/text()"))
        loader.add_xpath('h2',("//h2/text()"))
        loader.add_xpath('h3',("//h3/text()"))
        yield loader.load_item()
在下一个域建立之前,yield语句一直运行

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from Autotask_Prospecting.items import AutotaskProspectingItem
from Autotask_Prospecting.items import WebsiteItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from nltk import clean_html


class MySpider(CrawlSpider):
    name = 'Webcrawler'
    allowed_domains = [ l.strip() for l in open('Domains.txt').readlines() ]
    start_urls = [ l.strip() for l in open('start_urls.txt').readlines() ]


    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(), callback='parse_item'),  
        )

    def parse_item(self, response):
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=WebsiteItem(), response=response)
        loader.add_xpath('h1',("//h1/text()"))
        loader.add_xpath('h2',("//h2/text()"))
        loader.add_xpath('h3',("//h3/text()"))
        yield loader.load_item()
无法完成,事情是并行完成的,并且没有办法连续进行域爬网

你所能做的就是将它们累加起来,并在
spider\u close
上生成整个结构,比如:

# this assume your item looks like the following
class MyItem():
    domain = Field()
    hs = Field()


import collections
class DomainPipeline(object):

    accumulator = collections.defaultdict(set)

    def process_item(self, item, spider):
        accumulator[item['domain']].update(item['hs'])

    def close_spider(spider):
        for domain,hs in accumulator.items():
            yield MyItem(domain=domain, hs=hs)
用法:

>>> from scrapy.item import Item, Field
>>> class MyItem(Item):
...     domain = Field()
...     hs = Field()
... 
>>> from collections import defaultdict
>>> accumulator = defaultdict(set)
>>> items = []
>>> for i in range(10):
...     items.append(MyItem(domain='google.com', hs=[str(i)]))
... 
>>> items
[{'domain': 'google.com', 'hs': ['0']}, {'domain': 'google.com', 'hs': ['1']}, {'domain': 'google.com', 'hs': ['2']}, {'domain': 'google.com', 'hs': ['3']}, {'domain': 'google.com', 'hs': ['4']}, {'domain': 'google.com', 'hs': ['5']}, {'domain': 'google.com', 'hs': ['6']}, {'domain': 'google.com', 'hs': ['7']}, {'domain': 'google.com', 'hs': ['8']}, {'domain': 'google.com', 'hs': ['9']}]
>>> for item in items:
...     accumulator[item['domain']].update(item['hs'])
... 
>>> accumulator
defaultdict(<type 'set'>, {'google.com': set(['1', '0', '3', '2', '5', '4', '7', '6', '9', '8'])})
>>> for domain, hs in accumulator.items():
...     print MyItem(domain=domain, hs=hs)
... 
{'domain': 'google.com',
 'hs': set(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])}
>>> 
>>来自scrapy.item导入项目,字段
>>>类别MyItem(项目):
...     域=字段()
...     hs=字段()
... 
>>>从集合导入defaultdict
>>>累加器=默认DICT(设置)
>>>项目=[]
>>>对于范围(10)内的i:
...     append(MyItem(domain='google.com',hs=[str(i)])
... 
>>>项目
[{'domain':'google.com','hs':['0']},{'domain':'google.com','hs':['1']},{'domain':'google.com','hs':['2']},{'domain':'google.com','hs':['4']},{'domain':'google com','hs':['5']},{'domain':'google domain':'com','google com','hs':['6']},{'domain':'google com','hs':'7'.['hs']},{',{'domain':'google.com','hs':['9']}]
>>>对于项目中的项目:
...     累加器[item['domain']]。更新(item['hs'])
... 
>>>累加器
defaultdict(,{'google.com':set(['1','0','3','2','5','4','7','6','9','8']))
>>>对于域,累加器中的hs.items():
...     打印MyItem(域=域,hs=hs)
... 
{'domain':'google.com',
“hs”:集合(['0','1','2','3','4','5','6','7','8','9'])
>>> 

这个累加器是如何工作的?我试图实现您的代码,但我只获取未定义的集合。我在上创建了一个新问题。我尝试了你的新代码,并得到使用项['domain']的错误。我在新问题中使用了这种语法。主要问题是项目值不能输入defaultdict。因为它们看起来类似于列表对象。我需要一种方法来转换这些对象,以便defaultdict可以读取它们。但是在spider类中,我需要将全局变量添加到累加器中,并关闭_spider(spider,set.),以实际运行它。但它并没有刮到任何东西,只是一点也没有。这比我最初想象的要难…很抱歉,您不应该定义全局参数,也不应该更改close_spider签名。。。继续练习,直到你做对为止