Python 使用标记不正确的词典读取文件

Python 使用标记不正确的词典读取文件,python,dictionary,formatting,Python,Dictionary,Formatting,我有一个文件,里面有一系列字典,其中大部分都用引号做了不恰当的标记。一个例子如下: {game:Available,player:Available,location:"Chelsea, London, England",time:Available} {"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"} 正如您所

我有一个文件,里面有一系列字典,其中大部分都用引号做了不恰当的标记。一个例子如下:

{game:Available,player:Available,location:"Chelsea, London, England",time:Available}
{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}
正如您所看到的,不同的词典的键也可能不同

我试图用json模块或csv模块的DictReader读取,但每次都有困难,因为“”总是出现在位置值中,但不总是出现在其他键或值中。到目前为止,我看到了两种可能性:

  • 替换位置值中的“;”,并删除所有引号
  • 为每个值和键(位置除外)添加引号

  • PS:我的最后一点是能够格式化所有这些字典,以创建一个SQL表,其中列是所有字典的并集,每行是我的字典中的一个,当缺少值时为空。

    如果它比您作为示例给出的更复杂,或者如果它必须更快,你也许应该调查一下

    否则,你可以写一些更粗糙的东西,比如:

    contentlines = ["""{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}""", """{game:Available,player:Available,location:"Chelsea, London, England",time:Available}"""]
    def get_dict(line):
        keys = []
        values = []
        line = line.replace("{", "").replace("}", "")
        contlist = line.split(":")
        keys.append(contlist[0].strip('"').strip("'"))
        for entry in contlist[1:-1]:
            entry = entry.strip()
            if entry[0] == "'" or entry[0] == '"':
                endpos = entry[1:].find(entry[0]) + 2
            else:
                endpos = entry.find(",")
            values.append(entry[0:endpos].strip('"').strip("'"))
            keys.append(entry[endpos + 1:].strip('"').strip("'"))
        values.append(contlist[-1].strip('"').strip("'"))
        return dict(zip(keys, values))
    
    
    for line in contentlines:
        print get_dict(line)
    

    我想这是一个非常完整的代码

    首先,我创建了以下文件:

    {surprise : "perturbating at start  ", game:Available Universal Dices Game,
        player:FTROE875574,location
    :"Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada",time:15h18}
    
    {"game":"Available","   player":"LOI4531",
    "location":  "Perth, Australia","time":"08h13","date":"Available"}
    
    {"game":Available,player:PLLI874,location:"Chelsea, London, England",time:20h35}
    
    {special:"midnight happening",game:"Available","player":YTR44,
    "location":"Paris, France","time":"02h24"
    ,
    "date":"Available"}
    
    {game:Available,surprise:"  hretyuuhuhu  ",player:FT875,location
    :,"time":11h22}
    
    {"game":"Available","player":"LOI4531","location":
    "Damas,Syria","time":"unavailable","date":"Available"}
    
    {"surprise   " : GARAMANANATALA Tower ,  game:Available Dices,player  :
      PuLuLu874,location:"  Westminster, London, England  ",time:20h01}
    
    {"game":"Available",special:"overnight",   "player":YTR44,"location":
    "Madrid, Spain"    ,     "time":
    "12h33",
    date:"Available"
    }
    

    然后,以下代码分两个阶段处理文件内容:

    • 首先,遍历内容,收集所有词典中的所有中间键

    • 一个字典posis被扣除,它为每个键提供其对应值必须在一行中占据的位置

    • 其次,由于对文件进行了另一次运行,这些行被一个接一个地构建并收集到一个列表行中

    顺便说一句,请注意,与键位置“位置”关联的值的条件得到了遵守

    我编写上述代码时考虑的是一个数GB的巨大文件,无法完全读取:处理这样一个非常大的文件必须一块接一块地进行。这就是为什么会有说明:

    while chunk:
        chunk = f.read(120)
        ss = ''.join((prec,chunk))
        ecr.append('\n\n------------------------------------------------------------\nss   == %r' %ss)
        mat_dic = None
        for mat_dic in dicreg.finditer(ss):
            ............
            ...............
        if mat_dic:
            prec = ss[mat_dic.end():]
        else:
            prec += chunk
    
    但是,很明显,如果文件不是太大,因此一次就可以读取,那么代码可以简化:

    import re
    
    dicreg = re.compile('(?<=\{)[^}]*}')
    
    kvregx = re.compile('[ \r\n]*'
                        '(" *)?((location)|[^:]+?)(?(1) *")'
                        '[ \r\n]*'
                        ':'
                        '[ \r\n]*'
                        '(?(3)|(" *)?)([^:]*?)(?(4) *")'
                        '[ \r\n]*(?:,(?=[^,]+?:)|\})')
    
    
    checking_dict = {}
    checking_list = []
    
    filename = 'zzz.txt'
    
    with open(filename) as f:
        content = f.read()
    
    
    
    
    ######## First part: to gather all the keys in all the dictionaries
    
    ecr = []
    
    for mat_dic in dicreg.finditer(content):
        ecr.append('\nmmmmmmm dictionary found in ss mmmmmmmmmmmmmm')
        for mat_kv in kvregx.finditer(mat_dic.group()):
            k,v = mat_kv.group(2,5)
            ecr.append('%s  :  %s' % (k,v))
            if k in checking_list:
                checking_dict[k] += 1
            else:
                checking_list.append(k)
                checking_dict[k] = 1
    
    
    print '\n'.join(ecr)
    print '\n\n\nchecking_dict == %s\n\nchecking_list        == %s' %(checking_dict,checking_list)
    
    ######## The keys are sorted in order that the less frequent ones are at the end
    checking_list.sort(key=lambda k: checking_dict[k], reverse=True)
    posis = dict((k,i) for i,k in enumerate(checking_list))
    print '\nchecking_list sorted == %s\n\nposis == %s' % (checking_list,posis)
    
    
    
    ######## Now, the file is read again to build a list of rows 
    
    
    base = [ '' for i in xrange(len(checking_list))]
    rows = []
    
    for mat_dic in dicreg.finditer(content):
        li = base[:]
        for mat_kv in kvregx.finditer(mat_dic.group()):
            k,v = mat_kv.group(2,5)
            li[posis[k]] = v
        rows.append(li)
    
    
    print '\n\n%s\n%s' % (checking_list,30*'___')
    print '\n'.join(str(li) for li in rows)
    
    重新导入
    
    dicreg=re.compile(')(?希望随着时间的推移,这种pyparsing解决方案更易于遵循和维护:

    data = """\
    {game:Available,player:Available,location:"Chelsea, London, England",time:Available} 
    {"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}"""
    
    from pyparsing import Suppress, Word, alphas, alphanums, QuotedString, Group, Dict, delimitedList
    
    LBRACE,RBRACE,COLON = map(Suppress, "{}:")
    key = QuotedString('"') | Word(alphas) 
    value =  QuotedString('"') | Word(alphanums+"_")
    keyvalue = Group(key + COLON + value)
    
    dictExpr = LBRACE + Dict(delimitedList(keyvalue)) + RBRACE
    
    for d in dictExpr.searchString(data):
        print d.asDict()
    
    印刷品:

    {'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'}
    {'date': 'Available', 'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'}
    

    非常感谢您的回答。在尝试处理一个非常大的文件后,我遇到了一个内存错误:ecr.append('\n\n-----------------------------------------------------------------------------------\nss==%r'%ss)MemoryError。我会努力找出答案的!我用它来解决eyquem的答案中的内存问题。PyParsing似乎非常强大编辑:它工作得很好,但解析速度非常慢,很遗憾。谢谢
    import re
    
    text = """
    {game:Available,player:Available,location:"Chelsea, London, England",time:Available}
    {"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}
    """
    
    dicts = re.findall(r"{.+?}", text)                         # Split the dicts
    for dict_ in dicts:
        dict_ = dict(re.findall(r'(\w+|".*?"):(\w+|".*?")', dict_))    # Get the elements
        print dict_
    
    >>>{'player': 'Available', 'game': 'Available', 'location': '"Chelsea, London, England"', 'time': 'Available'}
    >>>{'"game"': '"Available"', '"time"': '"Available"', '"player"': '"Available"', '"date"': '"Available"', '"location"': '"Chelsea, London, England"'}
    
    data = """\
    {game:Available,player:Available,location:"Chelsea, London, England",time:Available} 
    {"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}"""
    
    from pyparsing import Suppress, Word, alphas, alphanums, QuotedString, Group, Dict, delimitedList
    
    LBRACE,RBRACE,COLON = map(Suppress, "{}:")
    key = QuotedString('"') | Word(alphas) 
    value =  QuotedString('"') | Word(alphanums+"_")
    keyvalue = Group(key + COLON + value)
    
    dictExpr = LBRACE + Dict(delimitedList(keyvalue)) + RBRACE
    
    for d in dictExpr.searchString(data):
        print d.asDict()
    
    {'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'}
    {'date': 'Available', 'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'}