Python'；剪贴编码问题_Python_Encoding_Scrapy

Python'；剪贴编码问题

python encoding scrapy

Python'；剪贴编码问题,python,encoding,scrapy,Python,Encoding,Scrapy,我正在尝试使用Scrapy从该站点进行刮取：以下是返回我在spider中导出的最后一项的函数： def parse_post(self, response): theitems = [] pubs = response.xpath("//div[@id='pubs']/ul/li/a") for i in pubs: item = FspeopleItem() name = str(response.xpath("//div[@id='m

我正在尝试使用Scrapy从该站点进行刮取：

以下是返回我在spider中导出的最后一项的函数：

def parse_post(self, response):
    theitems = []
    pubs = response.xpath("//div[@id='pubs']/ul/li/a")
    for i in pubs:
        item = FspeopleItem()
        name = str(response.xpath("//div[@id='maincol']/h1/text() | //nobr/text()").extract()).strip()
        pub = str(i.xpath("@title").extract()).strip() 
        item['link'] = response.url
        item['name'] = name
        item['pub'] = pub
        theitems.append(item)
    return theitems

出于某种原因，返回的“items”总是将重音字符（如Díaz中的í）显示为空格。我不明白这是为什么。当我打开一个Scrapy shell并将信息与xpath分开打印时，它可以很好地打印到控制台，但是当它从返回的“itItems”中打印出来时，它就变成了一个空白。我已经在Python2.7和3.5中对其进行了测试

我对Scrapy、编码和python都是新手。不过，除了这个编码问题，一切都正常。有人知道为什么会这样吗

多谢各位

///////编辑////////

谢谢你的建议。当我使用下面的代码（通过使用

.encode("utf-8")

及

在撰写我的文章时，带有口音的角色仍然显得很时髦。所以，我看了一下我正在抓取的网站上的编码，发现它们使用的是ISO-8859-1编码。于是我试着

.encode("ISO-8859-1")

当我打开.csv（所有的格式都很好）时，它正确地显示了带有重音符号的字符。然而，当我这么做的时候，大约25%的网站没有被删除——csv有大约1400个条目，而不是2100个条目。我不明白为什么它不删除一些站点而不是其他站点

import scrapy

from fspeople.items import FspeopleItem

class FSSpider(scrapy.Spider):
name = "hola"
allowed_domains = ["fs.fed.us"]
start_urls = [
    "http://www.fs.fed.us/research/people/people_search_results.php?employeename=&keywords=&station_id=SRS&state_id=ALL"]

def __init__(self):
    self.i = 0

def parse(self,response):
    for sel in response.xpath("//a[@title='Click to view their profile ...']/@href"):
        url = response.urljoin(sel.extract())
        yield scrapy.Request(url, callback=self.parse_post)
    self.i += 1

def parse_post(self, response):
    theitems = []
    pubs = response.xpath("//div[@id='pubs']/ul/li")
    for i in pubs:
        item = FspeopleItem()
        name = response.xpath("//div[@id='maincol']/h1/text() | //nobr/text()").extract_first().strip().encode("ISO-8859-1")
        pubname = i.xpath("a/text()").extract_first().strip().encode("ISO-8859-1")
        pubauth = i.xpath("text()").extract_first().strip().encode("ISO-8859-1")

        item['link'] = response.url
        item['name'] = name
        item['pubname'] = pubname
        item['pubauth'] = pubauth
        theitems.append(item)
    return theitems

使用

extract\u first（）

和

encode（）

：

使用

extract\u first（）

和

encode（）

：

这是一个编码/解码问题

正如Steve所说，它可能只是用来查看提取数据的软件

如果不是这样，请尝试删除

str（）

方法，看看会发生什么。或者将其更改为

unicode（）

[]。我通常不使用它们，我只是让字段填充来自

response.xpath（“…”）.extract（）的内容
另外，确保项目中的所有内容都是utf8：编写代码的文件、设置和字符串。例如，永远不要写它：
item['name'] = 'First name: ' + name

写这个（unicode！）：
这是一个编码/解码问题
正如Steve所说，它可能只是用来查看提取数据的软件
如果不是这样，请尝试删除str（）
方法，看看会发生什么。或者将其更改为unicode（）
[]。我通常不使用它们，我只是让字段填充来自response.xpath（“…”）.extract（）的内容
另外，确保项目中的所有内容都是utf8：编写代码的文件、设置和字符串。例如，永远不要写它：
item['name'] = 'First name: ' + name

写这个（unicode！）：
当你说“总是显示重音字符”时，你用什么软件显示输出？你用什么输出格式？当你说“总是显示重音字符”时，你用什么软件显示输出？你用什么输出格式？alecxe，我仍然有问题，尽管你的评论部分解决了问题。我刚刚对我的原始帖子的底部进行了编辑。如果站点是“ISO-8859-1”，我应该使用.encode（“ISO-8859-1”），对吗？但是，这不会立即删除所有条目。.encode（“uff-8”）删除了所有条目，但在csv.alecxe中呈现了奇怪的特殊字符，尽管您的评论部分解决了这个问题，但我仍然有问题。我刚刚对我的原始帖子的底部进行了编辑。如果站点是“ISO-8859-1”，我应该使用.encode（“ISO-8859-1”），对吗？但是，这不会立即删除所有条目。.encode（“uff-8”）会删除所有条目，但会在csv中奇怪地呈现特殊字符。
item['name'] = 'First name: ' + name

item['name'] = u'First name: ' + name