Python-使用unicode创建网站_Python_Unicode

Python-使用unicode创建网站

python unicode

Python-使用unicode创建网站,python,unicode,Python,Unicode,我正在尝试使用此代码刮取一个站点 #!/usr/bin/python #coding = utf-8 import urllib, urllib2 req = urllib2.Request(‘http://some website’) req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.

我正在尝试使用此代码刮取一个站点

    #!/usr/bin/python
    #coding = utf-8
    import urllib, urllib2
    req = urllib2.Request(‘http://some website’)
    req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
    f = urllib2.urlopen(req) 
    body = f.read()
    f.close()

这是read（）方法返回的文档的一部分

如何更改上述代码以获得这样的结果

    Tóm lược diễn tiến Thượng Hội Đồng Giám Mục về Gia Đình

多谢各位

我的问题是用mata的建议解决的。这是我的代码。谢谢大家的帮助，尤其是玛塔

 #!/usr/bin/python
#coding = utf-8
import urllib, urllib2
req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req) 
body = f.read().decode('unicode-escape').encode('utf-8')
f.close()

您需要检测页面的编码并对其进行解码，请尝试使用此库进行编码检测。请参阅使用帮助和示例

然后使用它

import urllib, urllib2
import chardet  #<- import this lib

req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req) 
body = f.read()
f.close()

code = chardet.detect(body)           #<- detect the encoding
body = body.decode(code['encoding'])  #<- decode

导入urllib，urllib2

导入chardet#您必须从页面检测编码。在大多数情况下，此信息会出现在请求的标题中

#!/usr/bin/python
#coding = utf-8

import cgi
import urllib2

req = urllib2.Request("http://some website")
req.add_header("User-agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
f = urllib2.urlopen(req)
encoding = f.headers.getparam('charset') # Here, you will detect the page encoding
body = f.read().decode(encoding) # Here you will define which encode use to decode data.
f.close()

还有其他方法可以得到同样的结果，但我只是适应了你的方法。

你有没有检查过类似的问题，看看它们是否有用？像python2一样，如果read（）返回的字符串中没有unicode转义，那么python2也不会在字符串中生成unicode转义，那么文档是否包含unicode转义（可能是JSON）？在这种情况下，

json

module模块可能会有所帮助，或者尝试

body.decode（'unicode-escape'）

beautifulsou对这类事情很有用，请大家帮忙。我的问题是用mata的建议解决的。

import urllib, urllib2
import chardet  #<- import this lib

req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req) 
body = f.read()
f.close()

code = chardet.detect(body)           #<- detect the encoding
body = body.decode(code['encoding'])  #<- decode

#!/usr/bin/python
#coding = utf-8

import cgi
import urllib2

req = urllib2.Request("http://some website")
req.add_header("User-agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
f = urllib2.urlopen(req)
encoding = f.headers.getparam('charset') # Here, you will detect the page encoding
body = f.read().decode(encoding) # Here you will define which encode use to decode data.
f.close()