Python 需要将所有文本转换为纯文本/ASCII（我想是吧？）_Python_Encoding_Mechanize

Python 需要将所有文本转换为纯文本/ASCII（我想是吧？）

python encoding

Python 需要将所有文本转换为纯文本/ASCII（我想是吧？）,python,encoding,mechanize,Python,Encoding,Mechanize,我试图从我工作的一个网站上抓取一个故事，当你输入URL，然后将其发布到我们的各个新闻合作伙伴。问题是，一些特殊的角色似乎在打嗝。我正试着去做。更换琴弦，但似乎效果不太好是否存在强制输出完全是可在任何地方邮寄的常规文本的情况？比如，没有特别的角色我目前的代码是： from __future__ import division #from __future__ import unicode_literals from __future__ import print_function import

我试图从我工作的一个网站上抓取一个故事，当你输入URL，然后将其发布到我们的各个新闻合作伙伴。问题是，一些特殊的角色似乎在打嗝。我正试着去做。更换琴弦，但似乎效果不太好

是否存在强制输出完全是可在任何地方邮寄的常规文本的情况？比如，没有特别的角色

我目前的代码是：

from __future__ import division
#from __future__ import unicode_literals
from __future__ import print_function
import spynner
from mechanize import Browser
import SendKeys
from BeautifulSoup import BeautifulSoup

br = Browser()
url = "http://www.benzinga.com/trading-ideas/long-ideas/11/07/1815251/bargain-hunting-for-mid-caps-five-stocks-worth-taking-a-look-"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)

artcontent = soup.find('div', {'class': 'article-content'})

title = artcontent.find('h1', {'id': 'title'})

title = title.string

try:
    title = title.replace("&#039;", "'")
except:
    pass

authorname = artcontent.find('div', {'class': 'node full'})
authorname = authorname.find('div', {'class': 'article-submitted'})
authorname = authorname.find('div', {'class': 'info'})
authorname = authorname.find('a')
authorname = authorname.string

story = artcontent.find('div', {'class': 'node full'})
story = story.find('div', {'class': 'content clear-block'})
story = story.findAll('p', {'class': None})

#story = [str(x).replace("<p>","\n\n").replace("</p>","") for x in story]

story = [str(x) for x in story]

storyunified = ''.join(story)

#try:
#    storyunified = storyunified.strip("\n")
#except:
#    pass
#try:
#    storyunified = storyunified.strip("\n")
#except:
#    pass

#print(storyunified)

try:
storyunified = storyunified.replace("Â", "")
except:
    pass

try:
    storyunified = storyunified.replace("â€", "\'")
except:
    pass

try:
    storyunified = storyunified.replace('“', '\"')
except:
    pass

try:
    storyunified = storyunified.replace('"', '\"')
except:
     pass

try:
    storyunified = storyunified.replace('”', '\"')
except:
    pass

try:
    storyunified = storyunified.replace("âﾀ", "")
except:
    pass

try:
    storyunified = storyunified.replace("â€", "")
except:
    pass

来自未来进口部的


#从未来导入unicode文字
来自未来导入打印功能
导入spynner
从mechanize导入浏览器
导入发送键
从BeautifulSoup导入BeautifulSoup
br=浏览器（）
url=”http://www.benzinga.com/trading-ideas/long-ideas/11/07/1815251/bargain-hunting-for-mid-caps-five-stocks-worth-taking-a-look-"
page=br.open（url）
html=page.read（）
soup=BeautifulSoup（html）
artcontent=soup.find（'div'，{'class'：'article content'}）
title=artcontent.find（'h1'，{'id'：'title'}）
title=title.string
尝试：
标题=标题。替换（'；“，”）
除：
通过
authorname=artcontent.find（'div'，{'class'：'node full'}）
authorname=authorname.find（'div'，{'class'：'article submitted'}）
authorname=authorname.find（'div'，{'class'：'info'}）
authorname=authorname.find（'a'）
authorname=authorname.string
story=artcontent.find（'div'，{'class'：'node full'}）
story=story.find（'div'，{'class'：'content clear block'}）
story=story.findAll（'p'，{'class'：None}）
#story=[str（x）.replace（“”，“\n\n”）。replace（“”，”）代表story中的x]
故事=[str（x）代表故事中的x]
storyunified=''.join（故事）
#尝试：
#storyunified=storyunified.strip（“\n”）
#除：
#通过
#尝试：
#storyunified=storyunified.strip（“\n”）
#除：
#通过
#打印（统一故事集）
尝试：
storyunified=storyunified.replace（替换为“，”）
除：
通过
尝试：
storyunified=storyunified.replace（“–欧元”，“\”））
除：
通过
尝试：
storyunified=storyunified.replace（“”，“\”）
除：
通过
尝试：
storyunified=storyunified.replace（“”，“\”）
除：
通过
尝试：
storyunified=storyunified.replace（“”，“\”）
除：
通过
尝试：
storyunified=storyunified.replace（–ﾀ", "")
除：
通过
尝试：
storyunified=storyunified.replace（“–欧元”，”）
除：
通过

正如你所看到的，我试图手动清除它们，但它似乎并不总是有效

然后我尝试使用Spynner发布，但我不认为这段代码很关键。我在福布斯博客上发布。

前几天我正在努力用Python编写字符编码

试试这个：

import unicodedata

storyunified = unicodedata.normalize('NFKD', storyunified).encode('ascii','ignore').decode("ascii")

要避免的一点是，它将删除违规字符而不是替换它们。要更改此行为，您可以将

忽略更改为替换，但我尚未对此进行任何测试。
请阅读本文，看看您是否已经熟悉其中讨论的原则：
我的直觉是，您的新闻合作伙伴能够接受超出ASCII编码范围的文本。您只需要确保应用程序正确处理字符串和字节字符串，并且一切正常
在Python2.x中，'this text'
是一个字节字符串，u'this text'
是一个字符串。在Python3.x中，'this text'
是一个字符串，b'this text'
是一个字节字符串。字节字符串有一个解码（编码）
方法，字符串有一个编码（编码）
方法
祝你好运