Scrapy:使用管道替换不需要的非ASCII代码

Scrapy:使用管道替换不需要的非ASCII代码,scrapy,scrapy-pipeline,Scrapy,Scrapy Pipeline,从粗略的结果来看,标题中有一个不需要的非ASCII代码\u2013(又称字符(150)或en-dash),例如u'Director/Senior Director\u2013 Physical'。我正在尝试使用管道删除带有常规,的\u2013。但是下面的代码不起作用。也不会报告任何错误消息 from datetime import datetime from hashlib import md5 from scrapy.exceptions import DropItem from twiste

从粗略的结果来看,标题中有一个不需要的非ASCII代码
\u2013
(又称
字符(150)
en-dash
),例如
u'Director/Senior Director\u2013 Physical'
。我正在尝试使用管道删除带有常规
\u2013
。但是下面的代码不起作用。也不会报告任何错误消息

from datetime import datetime
from hashlib import md5
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi
import re
import string

class ReplaceASC2InTitlePipeline(object):
"""replace unwanted ASCII characters in titles"""

ascii_to_filter = ["\u2013",]

def process_item(self, item, spider):
    for word in self.ascii_to_filter:
        desc = item.get('title')

        if (desc) and word in desc:
            spider.log("\u2013 in '%s' was replace" % (item['title']) )

            item['title']=item['title'].replace("\u2013", ",")
            return item
    else:
        return item
“\u2013”
应为unicode,因此只需替换:

ascii_to_filter = ["\u2013",]
与:


在阅读了这篇stackoverflow文章之后,我想到了这段代码,它将过滤掉标题中所有的非ASCII字符。在我的情况下,不需要非ASCII字符,因此它对我来说非常适合

from datetime import datetime
from hashlib import md5
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi
import re
import string

class ReplaceASC2InTitlePipeline(object):
"""replace unwanted non-ASCII characters in titles"""

def process_item(self, item, spider):

    def remove_non_ascii(text):
        return ''.join(i for i in text if ord(i)<128)

    orig_titl = item.get('title')
    item['title'] = remove_non_ascii(orig_titl) 

    if item['title'] != orig_titl:
        spider.log("Non-ASCII character(s) was removed in '%s'" % (item['title']) )

    return item
从日期时间导入日期时间
从hashlib导入md5
从scrapy.exceptions导入DropItem
从twisted.enterprise导入adbapi
进口稀土
导入字符串
类ReplaceASC2InTitlePipeline(对象):
“”“替换标题中不需要的非ASCII字符”“”
def过程_项目(自身、项目、蜘蛛):
def删除非ascii(文本):

返回“”。join(如果ord(i)我对else部分感到困惑,那么文本中的i for i)。如果它是for…else子句,通常会有for块的中断。还是缩进错误?代码是从我在Github上找到的一些代码修改而来的,这些代码用于丢弃不需要的项。但我对Python没有太多经验。
from datetime import datetime
from hashlib import md5
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi
import re
import string

class ReplaceASC2InTitlePipeline(object):
"""replace unwanted non-ASCII characters in titles"""

def process_item(self, item, spider):

    def remove_non_ascii(text):
        return ''.join(i for i in text if ord(i)<128)

    orig_titl = item.get('title')
    item['title'] = remove_non_ascii(orig_titl) 

    if item['title'] != orig_titl:
        spider.log("Non-ASCII character(s) was removed in '%s'" % (item['title']) )

    return item