Python 阅读各种语言的网页，如俄语、韩语等_Python

Python 阅读各种语言的网页，如俄语、韩语等

python

Python 阅读各种语言的网页，如俄语、韩语等,python,Python,各位为了我的研究项目，我收集了一些网页比如说, 正如您看到的上述网页，提交人的姓名不是英文其他网页也有提交人的名字，用各种语言写，而不是英语以下代码用于处理提交者的名称 import csv import re import urllib def get_page (link): k = 1 while k == 1: try: f = urllib.urlopen (link) htmlSource =

各位

为了我的研究项目，我收集了一些网页

比如说,

正如您看到的上述网页，提交人的姓名不是英文

其他网页也有提交人的名字，用各种语言写，而不是英语

以下代码用于处理提交者的名称

import csv
import re
import urllib

def get_page (link):
    k = 1
    while k == 1:
        try:
            f = urllib.urlopen (link)
            htmlSource = f.read()
            return htmlSource
        except EnvironmentError:
            print ('Error occured:', link)
        else:
            k = 2
    f.close()

def get_commit_info (commit_page):
    commit_page_string = str (commit_page)


    author_pattern = re.compile (r'<tr><th>author</th><td>(.*?)</td><td class=', re.DOTALL)
    t_author = author_pattern.findall (commit_page_string)

    t_author_string = str (t_author)
    author_point = re.search (" &lt;", t_author_string)
    author = t_author_string[:author_point.start()]

    print author

git_url = "http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3"
commit_page = get_page (git_url)
get_commit_info (commit_page)

“印刷作者”的结果如下：

\xd0\x9c\xd0\xb8\xd1\x80\xd0\xbe\xd1\x81\xd0\xbb\xd0\xb0\xd0\xb2\xd0\x9d\xd0\ xb8\xd0\xba\xd0\xbe\xd0\xbb\xd0\xb8\xd1\x9b

如何准确打印姓名？

嗯。。。这会满足你的要求

author = 'Мирослав Николић'
print author.decode('utf8') # Мирослав Николић

但是如果编码不是UTF8，它也不会工作

大多数东西都使用utf8。大部分

Unicode是一个复杂的东西，让你动脑author'是一个包含字节的字符串对象。这些字节中没有信息可以告诉您这些字节代表什么。绝对没有。您必须告诉Python这个字节字符串是UTF8中的代码点。对于遇到的每个字节，请在UTF8代码表中查找它，并查看它代表的UTF8 unicode标志符号

您可以通过查看meta标记来检测每个页面的编码。在html5中，它们看起来是这样的：

<meta charset="utf-8">.

您正在使用正则表达式解析HTML，这几乎总是很简单。使用像BeautifulSoup这样的库，它了解Unicode。这不是codereview.SE，而是。。。与k有关的东西可能也是个坏主意。如果您实际上可以通过进入无限循环从环境错误中恢复，请在else:子句中使用while True:和break，而不是引入新变量。