Python 解析非标准分号分隔“；JSON"；_Python_Json_Parsing

Python 解析非标准分号分隔“；JSON"；

python json parsing

Python 解析非标准分号分隔“；JSON"；,python,json,parsing,Python,Json,Parsing,我需要解析一个非标准的“JSON”文件。每个项目都以分号分隔，而不是逗号分隔。我不能简单地替换与，，因为可能有一些值包含，例如“你好，世界”。如何将其解析为JSON通常解析的相同结构 { "client" : "someone"; "server" : ["s1"; "s2"]; "timestamp" : 1000000; "content" : "hello; world"; ... } 使用Python将文本流转换为带有逗号而不是分号的文本流。Python标记器也乐于

我需要解析一个非标准的“JSON”文件。每个项目都以分号分隔，而不是逗号分隔。我不能简单地替换

与，
，因为可能有一些值包含，例如“你好，世界”。如何将其解析为JSON通常解析的相同结构
{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
  ...
}

使用Python将文本流转换为带有逗号而不是分号的文本流。Python标记器也乐于处理JSON输入，甚至包括分号。标记器将字符串表示为整个标记，“原始”分号在流中表示为单个token.OP
标记，供您替换：
import tokenize
import json

corrected = []

with open('semi.json', 'r') as semi:
    for token in tokenize.generate_tokens(semi.readline):
        if token[0] == tokenize.OP and token[1] == ';':
            corrected.append(',')
        else:
            corrected.append(token[1])

data = json.loads(''.join(corrected))

这假设一旦用逗号替换分号，格式就变成有效的JSON；e、 g.不允许在结束符]
或}
之前使用尾随逗号，但如果下一个非换行符标记是结束大括号，则您甚至可以跟踪添加的最后一个逗号并再次删除它
演示：
令牌间空白被删除，但可以通过注意tokenize.NL
令牌和（lineno，start）
和（lineno，end）
定位作为每个令牌一部分的元组来恢复。由于标记周围的空白对于JSON解析器来说并不重要，因此我不担心这一点。
您可以做一些奇怪的事情，并（可能）将其正确
因为JSON上的字符串不能有控制字符，比如\t
，所以可以替换每个
到\t，
，因此如果您的JSON解析器能够加载非严格的JSON（如Python），则文件将被正确解析
之后，您只需将数据转换回JSON，就可以将所有这些\t、
替换回
并使用普通的JSON解析器最终加载正确的对象
Python中的一些示例代码：
data = '''{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world"
}'''

import json
dec = json.JSONDecoder(strict=False).decode(data.replace(';', '\t,'))
enc = json.dumps(dec)
out = json.loads(dec.replace('\\t,' ';'))

使用简单的字符状态机，可以将此文本转换回有效的JSON。我们需要处理的基本问题是确定当前的“状态”（是否转义字符串、列表、字典等中的字符），并替换“；”当处于某种状态时
我不知道这是否是正确的编写方法，可能有一种方法可以缩短它，但我没有足够的编程技能来为它制作最佳版本
我尽可能多地发表评论：
def filter_characters(text):
    # we use this dictionary to match opening/closing tokens
    STATES = {
        '"': '"', "'": "'",
        "{": "}", "[": "]"
    }

    # these two variables represent the current state of the parser
    escaping = False
    state = list()

    # we iterate through each character
    for c in text:
        if escaping:
            # if we are currently escaping, no special treatment
            escaping = False
        else:
            if c == "\\":
                # character is a backslash, set the escaping flag for the next character
                escaping = True
            elif state and c == state[-1]:
                # character is expected closing token, update state
                state.pop()
            elif c in STATES:
                # character is known opening token, update state
                state.append(STATES[c])
            elif c == ';' and state == ['}']:
                # this is the delimiter we want to change
                c = ','
        yield c

    assert not state, "unexpected end of file"

def filter_text(text):
    return ''.join(filter_characters(text))

使用以下各项进行测试：
{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
  ...
}

返回：
{
  "client" : "someone",
  "server" : ["s1"; "s2"],
  "timestamp" : 1000000,
  "content" : "hello; world",
  ...
}

Pyparsing使编写字符串转换器变得容易。为要更改的字符串编写一个表达式，并添加一个解析操作（解析时间回调）以用所需内容替换匹配的文本。如果需要避免某些情况（如带引号的字符串或注释），请将它们包括在扫描程序中，但保持不变。然后，要实际转换字符串，请调用scanner.transformString

（从您的示例中不清楚是否在一个括号内列表的最后一个元素后面有一个“；”，因此我添加了一个术语来抑制这些，因为括号内列表中的尾随“，”也是无效的JSON。）
印刷品：
{
  "client" : "someone",
  "server" : ["s1", "s2"],
  "timestamp" : 1000000,
  "content" : "hello; world"}
{'content': 'hello; world', 'timestamp': 1000000, 'client': 'someone', 'server': ['s1', 's2']}

这种可憎的东西是怎么来的？是分离的总是在这行的末尾？它只是JSON中的一个对象吗？这不是“非标准JSON”，这不是JSON。找出它是什么，并为此找到一个解析器。
sample = """
{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
}"""


from pyparsing import Literal, replaceWith, Suppress, FollowedBy, quotedString
import json

SEMI = Literal(";")
repl_semi = SEMI.setParseAction(replaceWith(','))
term_semi = Suppress(SEMI + FollowedBy('}'))
qs = quotedString

scanner = (qs | term_semi | repl_semi)
fixed = scanner.transformString(sample)
print(fixed)
print(json.loads(fixed))

{
  "client" : "someone",
  "server" : ["s1", "s2"],
  "timestamp" : 1000000,
  "content" : "hello; world"}
{'content': 'hello; world', 'timestamp': 1000000, 'client': 'someone', 'server': ['s1', 's2']}