使用Python Django BeautifulSoup和Curl正确地抓取和显示日文字符_Python_Django_Utf 8_Beautifulsoup_Iso 8859 1

使用Python Django BeautifulSoup和Curl正确地抓取和显示日文字符

python django utf-8

使用Python Django BeautifulSoup和Curl正确地抓取和显示日文字符,python,django,utf-8,beautifulsoup,iso-8859-1,Python,Django,Utf 8,Beautifulsoup,Iso 8859 1,我正在尝试使用python、curl和BeautifulSoup用日语刮一个页面。然后，我将文本保存到使用utf-8编码的MySQL数据库中，并使用Django显示结果数据以下是一个示例URL：我有一个函数，用于将HTML提取为字符串： def get_html(url): c = Curl() storage = StringIO() c.setopt(c.URL, str(url)) cookie_file = 'cookie.txt' c.se

我正在尝试使用python、curl和BeautifulSoup用日语刮一个页面。然后，我将文本保存到使用utf-8编码的MySQL数据库中，并使用Django显示结果数据

以下是一个示例URL：

我有一个函数，用于将HTML提取为字符串：

def get_html(url):
    c = Curl()
    storage = StringIO()
    c.setopt(c.URL, str(url))
    cookie_file = 'cookie.txt'
    c.setopt(c.COOKIEFILE, cookie_file)
    c.setopt(c.COOKIEJAR, cookie_file)
    c.setopt(c.WRITEFUNCTION, storage.write)
    c.perform()
    c.close()
    return storage.getvalue()

然后我将其传递给BeautifulSoup：

html = get_html(str(scheduled_import.url))
soup = BeautifulSoup(html)

然后对其进行解析并将其保存到数据库中。然后我使用Django将数据输出到json。以下是我正在使用的视图：

def get_jobs(request):
    jobs = Job.objects.all().only(*fields)
    joblist = []
    for job in jobs:
        job_dict = {}
        for field in fields:
            job_dict[field] = getattr(job, field)
        joblist.append(job_dict)
    return HttpResponse(dumps(joblist), mimetype='application/javascript')

生成的页面显示字节码，例如：

xe3\x82\xb7\xe3\x83\xa3\xe3\x83\xaa\xe3\x82\xb9\xe3\x83\x88

\xe8\x81\xb7\xe5\x8b\x99\xe5\x86\x85\xe5\xae\xb9
\X8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 xe3\x81\xaf\xe3\x80\x81\xe4\xba\xba\xe3\x82\x92\xe4\xb8\xad\xe5\xbf\x83\xe3\x81\xa8\xe3\x81\x97\xe3\x81\x9f\xe3\x82\xb3\xe3\x9f\xe3\x83\xa5\xe3\x83\x8b\xe3\x82\xb1\xe3\x83\x82\xb7\xe3\x83\xb3\xe3\x81\xab\xe3\x82\x88\x83\

而不是日语

我已经研究了一整天，把我的DB转换成utf-8，试着从iso-8859-1解码文本，然后编码成utf-8

基本上，我不知道自己在做什么，如果能得到任何帮助或建议，我将不胜感激，这样我就可以避免再花一天的时间来尝试解决这个问题。

您发布的示例以某种方式就是字符串的ascii表示。您需要将其转换为python unicode字符串。通常你可以用它来做这项工作。如果您不确定哪种方法是正确的，只需在python控制台中进行实验

尝试

my\u new\u string=my\u string.decode（'utf-8'）

获取python unicode字符串。这应该在Django模板中正确显示，可以保存到DB等。。例如，您也可以尝试

打印my_new_string

，看到它正在输出日语字符。

您忘记告诉Beauty Soup编码。从响应标题中获取。我相信BeautifulSoup会根据页面的元标记自动设置编码，根据这一点，“标记可以指定文档的编码。”并且soup.originalEncoding输出“iso-8859-1”。您假设页面有一个要读取的元标记。在这种情况下，我应该提到的是，在这件事上没有任何运气。当我在控制台中打印时，我可以得到原始HTML字符串来显示日文字符，使用beautifulsoup输出可以得到的最好结果是获得时髦的字符，例如“ã&euro；ã&fnof；¸&sbquo；·ã&fnof；§ã&fnof；euro；
ãeuro；&euro；ã&fnof人äº&lsaquo；äº&lsaquo；å&lsaquo；&trade；æ&lsaquo；&hellip；å½&ldquo；
æ&euro；æ&permil；&euro；å&&x17E&fnof署å??"