Can';t使用Python 3.7使用unicode代码解析Json文本
这应该是小菜一碟,但我对Python还不熟悉,我似乎不明白应该如何做到这一点: 我有一个JSON文件,是通过从Facebook检索我的个人数据得到的,这只是文件的一部分:Can';t使用Python 3.7使用unicode代码解析Json文本,python,json,python-3.x,character-encoding,Python,Json,Python 3.x,Character Encoding,这应该是小菜一碟,但我对Python还不熟悉,我似乎不明白应该如何做到这一点: 我有一个JSON文件,是通过从Facebook检索我的个人数据得到的,这只是文件的一部分: [ { "timestamp": 1575826804, "attachments": [ ], "data": [ { "post": "This is a test line with character \u00c3\u00ad and \u00c3\u0
[
{
"timestamp": 1575826804,
"attachments": [
],
"data": [
{
"post": "This is a test line with character \u00c3\u00ad and \u00c3\u00b3"
},
{
"update_timestamp": 1575826804
}
],
"title": "My Name"
},
{
"timestamp": 1575826526,
"attachments": [
],
"data": [
{
"update_timestamp": 1575826526
}
],
"title": "My Name"
},
{
"timestamp": 1575638718,
"data": [
{
"post": "This is another test line with character \u00c3\u00ad and \u00c3\u00b3 and line breaks\n"
}
],
"title": "My Name escribi\u00c3\u00b3 en la biograf\u00c3\u00ada de Someone."
},
{
"timestamp": 1575561399,
"attachments": [
{
"data": [
{
"external_context": {
"url": "https://youtu.be/lalalalalalaaeeeE"
}
}
]
}
],
"data": [
{
"update_timestamp": 1575561399
}
],
"title": "My Name"
}
]
该文件有许多unicode代码,如“\u00c3\u00ad”,我需要将其转换为ASCII表示形式。我尝试解析这个JSON文件,并将其作为带有“JSON”库的Python对象加载,首先:
with open("test.json") as fp:
data = json.load(fp)
print(type(data))
print(data[0])
# output:
# <class 'list'>
# {'timestamp': 1575826804, 'attachments': [], 'data': [{'post': 'This is a test line with
# character Ã\xad and ó'}, {'update_timestamp': 1575826804}], 'title': 'My Name'}
仅当json字符串在json值中不包含任何字符行换行符“\n”或“:”时,此Second尝试才会起作用,但在类似我的情况下,它将抛出:
JSONDecodeError: Invalid control character at: line 33 column 82 (char 560)
字符560是JSON值“post”中的尾随“\n”:
我应该如何正确地用Unicode加载这个JSON?它是否是替代ASCII字符的unicode字符串的唯一方法
提前感谢您的帮助 我认为您需要使用“原始unicode\u escape”
import json
with open("j.json", encoding='raw_unicode_escape') as f:
data = json.loads(f.read().encode('raw_unicode_escape').decode())
print(data[0])
OUT: {'timestamp': 1575826804, 'attachments': [], 'data': [{'post': 'This is a test line with character í and ó'}, {'update_timestamp': 1575826804}], 'title': 'My Name'}
这有用吗?嗯,真奇怪
\u00c3\u00b3
实际上是Unicode中的ó。@IvánC.:当解析为UTF8时,它将是字符ò
。这使它工作起来!现在unicode代码被正确解码了,谢谢!
{
"post": "This is another test line with character \u00c3\u00ad and \u00c3\u00b3 and line breaks\n"
}
import json
with open("j.json", encoding='raw_unicode_escape') as f:
data = json.loads(f.read().encode('raw_unicode_escape').decode())
print(data[0])
OUT: {'timestamp': 1575826804, 'attachments': [], 'data': [{'post': 'This is a test line with character í and ó'}, {'update_timestamp': 1575826804}], 'title': 'My Name'}