Python 剪贴式管道html解析
我有一个蜘蛛与3项:网址,标题和类别 他们在原始html中加载得很好,但现在我想在管道中使用html2test将标题和类别转换为纯文本 这是我不正确的管道代码,有人能帮我调试一下吗 多谢各位Python 剪贴式管道html解析,python,scrapy,pipeline,Python,Scrapy,Pipeline,我有一个蜘蛛与3项:网址,标题和类别 他们在原始html中加载得很好,但现在我想在管道中使用html2test将标题和类别转换为纯文本 这是我不正确的管道代码,有人能帮我调试一下吗 多谢各位 import html2text import csv from tutorial import settings def write_to_csv(item): writer = csv.writer(open(settings.csv_file_path, 'a'), lineterminat
import html2text
import csv
from tutorial import settings
def write_to_csv(item):
writer = csv.writer(open(settings.csv_file_path, 'a'), lineterminator='\n')
writer.writerow([item[key] for key in item.keys()])
class TutorialPipeline(object):
def process_item(self, item, spider):
h = html2text.HTML2Text()
h.ignore_images = True
h.handle(item['title']).strip()
h.handle(item['category']).strip()
write_to_csv(item)
return item
蜘蛛代码
import scrapy
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
from tutorial.items import TutorialItem
class tuto(CrawlSpider):
name = "tuto"
allowed_domains = ['emedicine.medscape.com']
start_urls=["http://emedicine.medscape.com"]
rules=(
Rule( LinkExtractor(restrict_xpaths ='//div[@id="browsespecialties"]'),callback='follow_pages', follow=True),
)
def follow_pages(self, response):
for sel in response.xpath('//div[@class="maincolbox"]//a/@href').extract():
yield Request("http://emedicine.medscape.com/" + sel, callback = self.parse_item)
def parse_item(self, response):
item = TutorialItem()
item['url'] = response.url
item['background'] = response.xpath('//div[@class="refsection_content"]').extract()
item['title'] = response.xpath('//h1').extract()
yield item
问题是管道代码没有将html到文本转换的结果赋值。要更新项目,应将转换位更改为:
...
item['title'] = h.handle(item['title']).strip()
item['category'] = h.handle(item['category']).strip()
...
您不需要解析管道中的HTML 提取spider中元素的
text()
替换:
item['background'] = response.xpath('//div[@class="refsection_content"]').extract()
item['title'] = response.xpath('//h1').extract()
与:
item['background'] = response.xpath('//div[@class="refsection_content"]/text()').extract()
item['title'] = response.xpath('//h1/text()').extract()