在Python中从Unicode Web Scrape输出ascii文件_Python_Unicode

在Python中从Unicode Web Scrape输出ascii文件

python unicode

在Python中从Unicode Web Scrape输出ascii文件,python,unicode,Python,Unicode,我是Python编程新手。我在Python文件中使用以下代码： import gethtml import articletext url = "http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece" result = articletext.getArticle(url) text_file = open("Output.txt", "w")

我是Python编程新手。我在Python文件中使用以下代码：

import gethtml
import articletext
url = "http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece"
result = articletext.getArticle(url)
text_file = open("Output.txt", "w")

text_file.write(result)

text_file.close()

文件

articletext.py

包含以下代码：

from bs4 import BeautifulSoup
import gethtml
def getArticleText(webtext):
    articletext = ""
    soup = BeautifulSoup(webtext)
    for tag in soup.findAll('p'):
        articletext += tag.contents[0]
    return articletext

def getArticle(url):
    htmltext = gethtml.getHtmlText(url)
    return getArticleText(htmltext)

但我得到了以下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 473: ordinal not in range(128)
To print the result into the output file, what proper code should I write ?

The output `result` is text in the form of a paragraph.

这应该行得通，试试看

为什么?？因为将所有内容保存为字节和utf-8，它将忽略这些类型的编码错误：D

编辑确保文件存在于同一文件夹中，否则将此代码放在导入之后，它将创建文件本身

text_filefixed = open("Output.txt", "a")
text_filefixed.close()

它创建它，不保存任何内容，关闭文件。。。但它是自动创建的，不需要人工交互

Edit2 请注意，这仅适用于3.3.2，但我知道您可以使用此模块实现2.7中的相同功能。一些细微的区别是（我认为）在2.7中不需要请求，但您应该检查一下

from urllib import request
result = str(request.urlopen("http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece").read())
text_filefixed = open("Output.txt", "wb")
text_filefixed.write(bytes(result, 'UTF-8')) 
text_filefixed.close()

正如我所想，您将在2.7中发现此错误，

为了解决unicode错误，我们需要将文本编码为unicode（准确地说是UTF-8），而不是ascii。为了确保在出现编码错误时不会抛出错误，我们将忽略任何没有映射的字符。（您也可以使用“替换”或str.encode提供的其他选项。）

打开文件的最佳实践是使用Python上下文管理器，即使出现错误，它也会关闭文件。我在路径中使用斜杠而不是反斜杠，以确保这在Windows或Unix/Linux中都能正常工作

text = text.encode('UTF-8', 'ignore')
with open('/temp/Out.txt', 'w') as file:
    file.write(text)

这相当于

text = text.encode('UTF-8', 'ignore')
try:
    file = open('/temp/Out.txt', 'w')
    file.write(text)
finally:
    file.close()

但是，上下文管理器更不冗长，更不容易导致您在错误中间锁定文件。< /P>我可能在尝试它时得到以下错误：<代码> Traceback（最近的调用最后）：文件“C:/Python 27 /Currn/Man.Py”，第7行，在TythFixFieldFix.中写入（字节）（结果，'UTF-8'））TypeError:str（）最多接受1个参数（给定2个）哦，您使用的是python 2.7。我的代码在3.3.2中工作。可能需要对其进行调整，而且……老实说，不知道如何调整。如果打印，您得到的是一个工作字符串？可能尝试编写str（结果）

text = text.encode('UTF-8', 'ignore')
try:
    file = open('/temp/Out.txt', 'w')
    file.write(text)
finally:
    file.close()