Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/334.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
包含单引号和双引号以及收缩的Python解析文件_Python_Json_Parsing_Text Mining - Fatal编程技术网

包含单引号和双引号以及收缩的Python解析文件

包含单引号和双引号以及收缩的Python解析文件,python,json,parsing,text-mining,Python,Json,Parsing,Text Mining,我试图解析一个文件,其中一些行可能包含单引号、双引号和收缩的组合。每个观察值包括一个字符串,如上所示。当试图解析数据时,我在解析评论时遇到了问题。例如: \'text\' : \'This is the first time I've tried really "fancy food" at a...\' 或 使用简单的双替换预处理字符串-首先转义所有引号,然后用引号替换所有转义撇号-这将简单地反转转义,例如: # we'll define it as an object to keep t

我试图解析一个文件,其中一些行可能包含单引号、双引号和收缩的组合。每个观察值包括一个字符串,如上所示。当试图解析数据时,我在解析评论时遇到了问题。例如:

\'text\' : \'This is the first time I've tried really "fancy food" at a...\' 


使用简单的双替换预处理字符串-首先转义所有引号,然后用引号替换所有转义撇号-这将简单地反转转义,例如:

# we'll define it as an object to keep the validity
src = "{\\'text\\' : \\'This is the first time I've tried really \"fancy food\" at a...\\'}"
# The double escapes are just so we can type it properly in Python.
# It's still the same underneath:
# {\'text\' : \'This is the first time I've tried really "fancy food" at a...\'}

preprocessed = src.replace("\"", "\\\"").replace("\\'", "\"")
# Now it looks like:
# {"text" : "This is the first time I've tried really \"fancy food\" at a..."}
现在它是一个有效的JSON(顺便说一句,也是一个Python字典),因此您可以继续分析它:

import json

parsed = json.loads(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}
或:

更新

根据发布的行,您实际上有一个7元素元组的(有效)表示形式,其中包含字典的字符串表示形式作为其第三个元素,您根本不需要预处理字符串。您需要的是首先对元组求值,然后用另一个求值级别对内部的
dict
进行后处理,即:

import ast

# lets first read the data from a 'input.txt' file so we don't have to manually escape it
with open("input.txt", "r") as f:
    data = f.read()

data = ast.literal_eval(data)  # first evaluate the main structure
data = data[:2] + (ast.literal_eval(data[2]), ) + data[3:]  # .. and then the inner dict

# this gives you `data` containing your 'serialized' tuple, i.e.:
print(data[4])  # 31.328237,-85.811893
# and you can access the children of the inner dict as well, i.e.:
print(data[2]["types"])  # ['restaurant', 'food', 'point_of_interest', 'establishment']
print(data[2]["opening_hours"]["weekday_text"][3])  # Thursday: 7:00 AM – 9:00 PM
# etc.

话虽如此,我还是建议追踪生成此类数据的人,并说服他们使用某种适当的序列化形式,即使是最基本的JSON也比这更好。

为什么不使用
JSON
模块?@我使用的只是JSON模块……问题是“JSON”我收到的不是json,因为它的格式是“{\\地址\组件\”:[{'long\u name\':'Fairhope\','short\u name\':…我必须对其进行一些重新格式化,以使json.loads生效上面的字符串似乎是jsonthough@mad_不作为JSON验证,因为它需要双引号而不是单引号,所以“long_name”:“Fairhope”应该是“long_name”:“Fairhope”或者至少这是我让pyton将其解读为json的唯一方法。那么json标记和标题一样具有误导性,因为您根本没有json。但如果它不是json,则无法回答这个问题,因为除了您之外,可能连您都不知道您有什么。感谢您的洞察力,但对我来说并不太有效。我曾经尝试过它给了我以前没有的解析问题。例如这部分:\'adr\U地址\':\'913 Rucker Blvd \'34,Enter@nbas-你能发布你试图解析的实际字符串吗?你期望从中得到什么?提取的部分应该可以通过上面的例程修复,但可能有一些部分不遵循该模式。我有更新了我的原始问题,其中一行给我带来了困难。它在页面顶部。
import ast

parsed = ast.literal_eval(preprocessed)
# {'text': 'This is the first time I\'ve tried really "fancy food" at a...'}
import ast

# lets first read the data from a 'input.txt' file so we don't have to manually escape it
with open("input.txt", "r") as f:
    data = f.read()

data = ast.literal_eval(data)  # first evaluate the main structure
data = data[:2] + (ast.literal_eval(data[2]), ) + data[3:]  # .. and then the inner dict

# this gives you `data` containing your 'serialized' tuple, i.e.:
print(data[4])  # 31.328237,-85.811893
# and you can access the children of the inner dict as well, i.e.:
print(data[2]["types"])  # ['restaurant', 'food', 'point_of_interest', 'establishment']
print(data[2]["opening_hours"]["weekday_text"][3])  # Thursday: 7:00 AM – 9:00 PM
# etc.