Python 使用标记不正确的词典读取文件
我有一个文件,里面有一系列字典,其中大部分都用引号做了不恰当的标记。一个例子如下:Python 使用标记不正确的词典读取文件,python,dictionary,formatting,Python,Dictionary,Formatting,我有一个文件,里面有一系列字典,其中大部分都用引号做了不恰当的标记。一个例子如下: {game:Available,player:Available,location:"Chelsea, London, England",time:Available} {"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"} 正如您所
{game:Available,player:Available,location:"Chelsea, London, England",time:Available}
{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}
正如您所看到的,不同的词典的键也可能不同
我试图用json模块或csv模块的DictReader读取,但每次都有困难,因为“”总是出现在位置值中,但不总是出现在其他键或值中。到目前为止,我看到了两种可能性:
PS:我的最后一点是能够格式化所有这些字典,以创建一个SQL表,其中列是所有字典的并集,每行是我的字典中的一个,当缺少值时为空。如果它比您作为示例给出的更复杂,或者如果它必须更快,你也许应该调查一下 否则,你可以写一些更粗糙的东西,比如:
contentlines = ["""{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}""", """{game:Available,player:Available,location:"Chelsea, London, England",time:Available}"""]
def get_dict(line):
keys = []
values = []
line = line.replace("{", "").replace("}", "")
contlist = line.split(":")
keys.append(contlist[0].strip('"').strip("'"))
for entry in contlist[1:-1]:
entry = entry.strip()
if entry[0] == "'" or entry[0] == '"':
endpos = entry[1:].find(entry[0]) + 2
else:
endpos = entry.find(",")
values.append(entry[0:endpos].strip('"').strip("'"))
keys.append(entry[endpos + 1:].strip('"').strip("'"))
values.append(contlist[-1].strip('"').strip("'"))
return dict(zip(keys, values))
for line in contentlines:
print get_dict(line)
我想这是一个非常完整的代码 首先,我创建了以下文件:
{surprise : "perturbating at start ", game:Available Universal Dices Game,
player:FTROE875574,location
:"Lakeview School, Kingsmere Boulevard, Saskatoon, Saskatchewan , Canada",time:15h18}
{"game":"Available"," player":"LOI4531",
"location": "Perth, Australia","time":"08h13","date":"Available"}
{"game":Available,player:PLLI874,location:"Chelsea, London, England",time:20h35}
{special:"midnight happening",game:"Available","player":YTR44,
"location":"Paris, France","time":"02h24"
,
"date":"Available"}
{game:Available,surprise:" hretyuuhuhu ",player:FT875,location
:,"time":11h22}
{"game":"Available","player":"LOI4531","location":
"Damas,Syria","time":"unavailable","date":"Available"}
{"surprise " : GARAMANANATALA Tower , game:Available Dices,player :
PuLuLu874,location:" Westminster, London, England ",time:20h01}
{"game":"Available",special:"overnight", "player":YTR44,"location":
"Madrid, Spain" , "time":
"12h33",
date:"Available"
}
然后,以下代码分两个阶段处理文件内容:
- 首先,遍历内容,收集所有词典中的所有中间键
- 一个字典posis被扣除,它为每个键提供其对应值必须在一行中占据的位置
- 其次,由于对文件进行了另一次运行,这些行被一个接一个地构建并收集到一个列表行中
while chunk:
chunk = f.read(120)
ss = ''.join((prec,chunk))
ecr.append('\n\n------------------------------------------------------------\nss == %r' %ss)
mat_dic = None
for mat_dic in dicreg.finditer(ss):
............
...............
if mat_dic:
prec = ss[mat_dic.end():]
else:
prec += chunk
但是,很明显,如果文件不是太大,因此一次就可以读取,那么代码可以简化:
import re
dicreg = re.compile('(?<=\{)[^}]*}')
kvregx = re.compile('[ \r\n]*'
'(" *)?((location)|[^:]+?)(?(1) *")'
'[ \r\n]*'
':'
'[ \r\n]*'
'(?(3)|(" *)?)([^:]*?)(?(4) *")'
'[ \r\n]*(?:,(?=[^,]+?:)|\})')
checking_dict = {}
checking_list = []
filename = 'zzz.txt'
with open(filename) as f:
content = f.read()
######## First part: to gather all the keys in all the dictionaries
ecr = []
for mat_dic in dicreg.finditer(content):
ecr.append('\nmmmmmmm dictionary found in ss mmmmmmmmmmmmmm')
for mat_kv in kvregx.finditer(mat_dic.group()):
k,v = mat_kv.group(2,5)
ecr.append('%s : %s' % (k,v))
if k in checking_list:
checking_dict[k] += 1
else:
checking_list.append(k)
checking_dict[k] = 1
print '\n'.join(ecr)
print '\n\n\nchecking_dict == %s\n\nchecking_list == %s' %(checking_dict,checking_list)
######## The keys are sorted in order that the less frequent ones are at the end
checking_list.sort(key=lambda k: checking_dict[k], reverse=True)
posis = dict((k,i) for i,k in enumerate(checking_list))
print '\nchecking_list sorted == %s\n\nposis == %s' % (checking_list,posis)
######## Now, the file is read again to build a list of rows
base = [ '' for i in xrange(len(checking_list))]
rows = []
for mat_dic in dicreg.finditer(content):
li = base[:]
for mat_kv in kvregx.finditer(mat_dic.group()):
k,v = mat_kv.group(2,5)
li[posis[k]] = v
rows.append(li)
print '\n\n%s\n%s' % (checking_list,30*'___')
print '\n'.join(str(li) for li in rows)
重新导入
dicreg=re.compile(')(?希望随着时间的推移,这种pyparsing解决方案更易于遵循和维护:
data = """\
{game:Available,player:Available,location:"Chelsea, London, England",time:Available}
{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}"""
from pyparsing import Suppress, Word, alphas, alphanums, QuotedString, Group, Dict, delimitedList
LBRACE,RBRACE,COLON = map(Suppress, "{}:")
key = QuotedString('"') | Word(alphas)
value = QuotedString('"') | Word(alphanums+"_")
keyvalue = Group(key + COLON + value)
dictExpr = LBRACE + Dict(delimitedList(keyvalue)) + RBRACE
for d in dictExpr.searchString(data):
print d.asDict()
印刷品:
{'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'}
{'date': 'Available', 'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'}
非常感谢您的回答。在尝试处理一个非常大的文件后,我遇到了一个内存错误:ecr.append('\n\n-----------------------------------------------------------------------------------\nss==%r'%ss)MemoryError。我会努力找出答案的!我用它来解决eyquem的答案中的内存问题。PyParsing似乎非常强大编辑:它工作得很好,但解析速度非常慢,很遗憾。谢谢
import re
text = """
{game:Available,player:Available,location:"Chelsea, London, England",time:Available}
{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}
"""
dicts = re.findall(r"{.+?}", text) # Split the dicts
for dict_ in dicts:
dict_ = dict(re.findall(r'(\w+|".*?"):(\w+|".*?")', dict_)) # Get the elements
print dict_
>>>{'player': 'Available', 'game': 'Available', 'location': '"Chelsea, London, England"', 'time': 'Available'}
>>>{'"game"': '"Available"', '"time"': '"Available"', '"player"': '"Available"', '"date"': '"Available"', '"location"': '"Chelsea, London, England"'}
data = """\
{game:Available,player:Available,location:"Chelsea, London, England",time:Available}
{"game":"Available","player":"Available","location":"Chelsea, London, England","time":"Available","date":"Available"}"""
from pyparsing import Suppress, Word, alphas, alphanums, QuotedString, Group, Dict, delimitedList
LBRACE,RBRACE,COLON = map(Suppress, "{}:")
key = QuotedString('"') | Word(alphas)
value = QuotedString('"') | Word(alphanums+"_")
keyvalue = Group(key + COLON + value)
dictExpr = LBRACE + Dict(delimitedList(keyvalue)) + RBRACE
for d in dictExpr.searchString(data):
print d.asDict()
{'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'}
{'date': 'Available', 'player': 'Available', 'game': 'Available', 'location': 'Chelsea, London, England', 'time': 'Available'}