Python 剪贴式管道html解析_Python_Scrapy_Pipeline

Python 剪贴式管道html解析

python scrapy

Python 剪贴式管道html解析,python,scrapy,pipeline,Python,Scrapy,Pipeline,我有一个蜘蛛与3项：网址，标题和类别他们在原始html中加载得很好，但现在我想在管道中使用html2test将标题和类别转换为纯文本这是我不正确的管道代码，有人能帮我调试一下吗多谢各位 import html2text import csv from tutorial import settings def write_to_csv(item): writer = csv.writer(open(settings.csv_file_path, 'a'), lineterminat

我有一个蜘蛛与3项：网址，标题和类别

他们在原始html中加载得很好，但现在我想在管道中使用html2test将标题和类别转换为纯文本

这是我不正确的管道代码，有人能帮我调试一下吗

多谢各位

import html2text
import csv
from tutorial import settings

def write_to_csv(item):
    writer = csv.writer(open(settings.csv_file_path, 'a'), lineterminator='\n')
    writer.writerow([item[key] for key in item.keys()])


class TutorialPipeline(object):
    def process_item(self, item, spider):
        h = html2text.HTML2Text()
        h.ignore_images = True
        h.handle(item['title']).strip()
        h.handle(item['category']).strip()
        write_to_csv(item)
        return item

蜘蛛代码

import scrapy
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
from tutorial.items import TutorialItem

class tuto(CrawlSpider):
    name = "tuto"
    allowed_domains = ['emedicine.medscape.com']
    start_urls=["http://emedicine.medscape.com"]
    rules=(
        Rule( LinkExtractor(restrict_xpaths ='//div[@id="browsespecialties"]'),callback='follow_pages', follow=True),
    )
    def follow_pages(self, response):
        for sel in response.xpath('//div[@class="maincolbox"]//a/@href').extract():
            yield Request("http://emedicine.medscape.com/" + sel, callback = self.parse_item)

    def parse_item(self, response):
        item = TutorialItem()
        item['url'] = response.url
        item['background'] = response.xpath('//div[@class="refsection_content"]').extract()
        item['title'] = response.xpath('//h1').extract()
        yield item

问题是管道代码没有将html到文本转换的结果赋值。要更新项目，应将转换位更改为：

...
item['title'] = h.handle(item['title']).strip()
item['category'] = h.handle(item['category']).strip()
...

您不需要解析管道中的HTML

提取spider中元素的

text（）

替换：

item['background'] = response.xpath('//div[@class="refsection_content"]').extract()
item['title'] = response.xpath('//h1').extract()

与：

item['background'] = response.xpath('//div[@class="refsection_content"]/text()').extract()
item['title'] = response.xpath('//h1/text()').extract()