Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/shell/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy—如何避免将收集的信息分组到一个项目中_Python_Html_Web_Web Crawler_Scrapy - Fatal编程技术网

Python Scrapy—如何避免将收集的信息分组到一个项目中

Python Scrapy—如何避免将收集的信息分组到一个项目中,python,html,web,web-crawler,scrapy,Python,Html,Web,Web Crawler,Scrapy,我和Scrapy收集的数据有问题。当我在终端中运行此代码时,所收集的信息似乎全部附加到一个项目中,如下所示: {"fax": ["Fax: 617-638-4905", "Fax: 925-969-1795", "Fax: 913-327-1491", "Fax: 507-281-0291", "Fax: 509-547-1265", "Fax: 310-437-0585"], "title": ["Challenges in Musculoskeletal Rehabilitation",

我和Scrapy收集的数据有问题。当我在终端中运行此代码时,所收集的信息似乎全部附加到一个项目中,如下所示:

{"fax": ["Fax: 617-638-4905", "Fax: 925-969-1795", "Fax: 913-327-1491", "Fax: 507-281-0291", "Fax: 509-547-1265", "Fax: 310-437-0585"], 
"title": ["Challenges in Musculoskeletal Rehabilitation", "17th Annual Spring Conference on Pediatric Emergencies", "19th Annual Association of Professors of Human & Medical Genetics (APHMG) Workshop & Special Interest Groups Meetings", "2013 AMSSM 22nd Annual Meeting", "61st Annual Meeting of Pacific Coast Reproductive Society (PCRS)", "Contraceptive Technology Conference 25th Anniversary", "Mid-America Orthopaedic Association 2013 Meeting", "Pain Management", "Peripheral Vascular Access Ultrasound", "SAGES 2013 / ISLCRS 8th International Congress"],  ... ...
。。。等等

问题是,每个字段的所有刮取信息最终都会出现在一个项目中。我需要将这些信息作为单独的项目显示出来。换句话说,我需要每个标题与一个传真号码(如果存在)和一个位置相关,以此类推

我不希望所有信息都集中在一起,因为收集到的每一条信息都与其他信息有一定的关系。我最终希望将其输入数据库的方式如下:

“MedEconItem”1:[标题:“此处插入标题1”,传真:“此处插入传真#1”,位置:“位置1”…]

“医疗项目”2:[标题:“标题2”,传真:“传真2”,地点:“地点2”…]

“MedEconItem”3:[…等等

关于如何处理这个问题有什么想法吗?有人知道如何轻松地分离这些信息吗?这是我第一次与Scrapy合作,所以欢迎提供任何建议。我一直在到处搜索,似乎找不到答案

这是我目前的代码:

import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class MedEconItem(Item):
    title = Field()
    date = Field()
    location = Field()
    specialty = Field()
    contact = Field()
    phone = Field()
    fax = Field()
    email = Field()
    url = Field()

class autoupdate(BaseSpider):
   name = "medecon"
   allowed_domains = ["www.doctorsreview.com"]
   start_urls = [
       "http://www.doctorsreview.com/meetings/search/?region=united-states&destination=all&specialty=all&start=YYYY-MM-DD&end=YYYY-MM-DD",
       ]

   def serialize_field(self, field, name, value):
       if field == '':
           return super(MedEconItem, self).serialize_field(field, name, value)

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//html/body/div[@id="c"]/div[@id="meeting_results"]')
       items = []
       for site in sites:
           item = MedEconItem()
           item['title'] = site.select('//h3/a/text()').extract()
           item['date'] = site.select('//p[@class = "dls"]/span[@class = "date"]/text()').extract()
           item['location'] = site.select('//p[@class = "dls"]/span[@class = "location"]/a/text()').extract()
           item['specialty'] = site.select('//p[@class = "dls"]/span[@class = "specialties"]/text()').extract()
           item['contact'] = site.select('//p[@class = "contact"]/text()').extract()
           item['phone'] = site.select('//p[@class = "phone"]/text()').extract()
           item['fax'] = site.select('//p[@class = "fax"]/text()').extract()
           item['email'] = site.select('//p[@class = "email"]/text()').extract()
           item['url'] = site.select('//p[@class = "website"]/a/@href').extract()
           items.append(item)
       return item

好的,下面的代码似乎可以工作,但遗憾的是,由于我对xpath很糟糕,因此涉及到一些公然的黑客攻击。精通xpath的人稍后可能会提供更好的解决方案

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//html/body/div[@id="c"]/div[@id="meeting_results"]//a[contains(@href,"meetings")]')
       items = []
       for site in sites[1:-1]:  
           item = MedEconItem()
           item['title'] = site.select('./text()').extract()
           item['date'] = site.select('./following::p[@class = "dls"]/span[@class="date"]/text()').extract()[0]
           item['location'] = site.select('./following::p[@class = "dls"]/span[@class = "location"]/a/text()').extract()[0]
           item['specialty'] = site.select('./following::p[@class = "dls"]/span[@class = "specialties"]/text()').extract()[0]
           item['contact'] = site.select('./following::p[@class = "contact"]/text()').extract()[0]
           item['phone'] = site.select('./following::p[@class = "phone"]/text()').extract()[0]
           item['fax'] = site.select('./following::p[@class = "fax"]/text()').extract()[0]
           item['email'] = site.select('./following::p[@class = "email"]/text()').extract()[0]
           item['url'] = site.select('./following::p[@class = "website"]/a/@href').extract()[0]
           items.append(item)
       return items

我尝试了这段代码,但它引发了一个NotImplementedError。它说它对网站进行了爬网,但随后说它在获取时出错,并说错误:Spider错误处理很奇怪。您使用的是什么版本的scrapy?我怀疑NotImplementedError是由serialize_字段抛出的,因为它不是由Spider实现的但由项目导出者进行。对该功能进行注释,看看是否解决了问题……(跳过一些).“文件”//usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py”,第488行,在startRunCallbacks self.\u runCallbacks()----文件“/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py”,第575行,在parse raise NotImplementedError exceptions.NotImplementedError:'