Java 几乎是JSON的正则表达式,但不完全是
大家好,我正试着把一个格式很好的字符串解析成它的组成部分。字符串非常类似于JSON,但严格来说它不是JSON。它们是这样形成的:Java 几乎是JSON的正则表达式,但不完全是,java,regex,Java,Regex,大家好,我正试着把一个格式很好的字符串解析成它的组成部分。字符串非常类似于JSON,但严格来说它不是JSON。它们是这样形成的: createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
输出就像文本块一样,此时无需执行任何特殊操作
createdAt=Fri Aug 24 09:48:51 EDT 2012
id=238996293417062401
text='Test Test'
source="Region"
entities=[foo, bar]
user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
使用下面的表达式,我可以将大部分字段分离出来
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))
它将在所有逗号上拆分,而不是在任何类型的引号中,但我似乎无法跳到在逗号上拆分的位置,而不是在括号或大括号中拆分。您可以使用以下正则表达式来匹配所需的块,而不是在逗号上拆分
(?:^ |)(.+?)=(\{.+?\}\[.+?\].+?)(?=,|$)
Python:
import re
text = "createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source=\"Region\", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}"
re.findall(r'(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)', text)
>> [
('createdAt', 'Fri Aug 24 09:48:51 EDT 2012'),
('id', '238996293417062401'),
('text', "'Test Test'"),
('source', '"Region"'),
('entities', '[foo, bar]'),
('user', '{name=test, locations=[loc1,loc2], locations={comp1, comp2}}')
]
我已经设置了分组,所以它会将“键”和“值”分开。它在Java中也会这样做-请参见此处的Java工作:
正则表达式解释:
匹配行或空格开头的非捕获组(?:^ |)
与(.+?)
等号=
匹配一组(\{.+?\}\[.+?\].+?)
字符{
,}
字符[
,或者最后只匹配字符]
与(?=,|$)
或行尾匹配的前视、
# just comma
sep_re = re.compile(r',')
# open paren or open bracket
inc_re = re.compile(r'[[(]')
# close paren or close bracket
dec_re = re.compile(r'[)\]]')
# string literal
# (I was lazy with the escaping. Add other escape sequences, or find an
# "official" regex to use.)
chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''')
# This class could've been just a generator function, but I couldn;'t
# find a way to manage the state in the match function that wasn't
# awkward.
class tokenizer:
def __init__(self):
self.pos = 0
def _match(self, regex, s):
m = regex.match(s, self.pos)
if m:
self.pos += len(m.group(0))
self.token = m.group(0)
else:
self.token = ''
return self.token
def tokenize(self, s):
field = '' # the field we're working on
depth = 0 # how many parens/brackets deep we are
while self.pos < len(s):
if not depth and self._match(sep_re, s):
# In Java, change the "yields" to append to a List, and you'll
# have something roughly equivalent (but non-lazy).
yield field
field = ''
else:
if self._match(inc_re, s):
depth += 1
elif self._match(dec_re, s):
depth -= 1
elif self._match(chunk_re, s):
pass
else:
# everything else we just consume one character at a time
self.token = s[self.pos]
self.pos += 1
field += self.token
yield field
此实现采用了一些快捷方式:
- 字符串转义非常懒惰:它只支持双引号字符串中的
,而单引号字符串中的\“
。这很容易修复\”
- 它只跟踪嵌套级别。它不验证paren是否与paren匹配(而不是括号)。如果您关心这一点,可以将
更改为某种堆栈,并将/pop paren/括号推到堆栈上深度
,则会严重失败,
。考虑到这不应该用正则表达式来解决,我已经尽了最大努力。如果您对我如何改进这一点有任何建议,请告诉我。它可以通过匹配引号字符串来解决(非常常见):“(?:[^”\]\124\\)+”
。但是您说的不应该用正则表达式来解决,这是对的,因为我指出的并不是正则表达式可能失败的唯一情况。
>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz'))
['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']