如何在Python3.x中识别和存储格式化文本文件中的变量和数据?
我试图用Python识别并存储文本文件中的数据。自从我上次使用Python已经有一段时间了,所以我可能对它已经生疏了。本质上,文本文件包含表单的数据如何在Python3.x中识别和存储格式化文本文件中的变量和数据?,python,string,python-3.x,file,Python,String,Python 3.x,File,我试图用Python识别并存储文本文件中的数据。自从我上次使用Python已经有一段时间了,所以我可能对它已经生疏了。本质上,文本文件包含表单的数据 <THING1> \ var1 = 0 \# var2 = "0.0 100.0 0.0" \# var3 = "IDENTIFYING_WORD" \# var4 = 2 \# </THING1>
<THING1> \
var1 = 0 \#
var2 = "0.0 100.0 0.0" \#
var3 = "IDENTIFYING_WORD" \#
var4 = 2 \#
</THING1>
<THING2> \
# something similar
</THING2>
但这并不是我想要的
理想情况下,我希望数据存储在Python中,如下所示:
var1 = 0
var2 = [0, 100, 0]
var3 = "IDENTIFYING_WORD"
var4 = 2
其中var1和var4是整数,var2是数组,var3是字符串。有人对此有想法吗?我试着在堆栈的其他地方寻找,但什么也找不到。如果这个问题得到了回答,请告诉我正确的方向,我会记下这个
谢谢 通常,当人们开始使用python编码时,他们会试图通过使用字符串操作或正则表达式等简单方法来解决这些典型的解析问题。。。这两种方法在解决简单问题时都很好,但对于更复杂的问题,有更好的选择 例如,对于这个特定的问题,没有真正的理由不尝试现有的许多问题中的一个。为了证明这一点,让我们看看如何使用库解决这个问题 安装pip后,请尝试以下代码段:
import sys
import textwrap
from lark import Lark
if __name__ == "__main__":
content = textwrap.dedent(r"""
<THING1> \
var1 = 0 \#
var2 = "0.0 100.0 0.0" \#
var3 = "IDENTIFYING_WORD" \#
# something similar
var4 = 2 \#
</THING1>
<THING2> \
# something similar
var1 = 0 \#
</THING2>
""")
grammar = r"""
?start: block*
block: tag_start line* tag_end
tag_start: "<" NAME ">" "\\"
tag_end: "</" NAME ">"
line: assignment
| comment
assignment: lhs "=" rhs "\#"
comment: "#" NAME* NEWLINE
lhs: NAME
rhs: ESCAPED_STRING
| NAME
| NUMBER
%import common.NEWLINE
%import common.ESCAPED_STRING
%import common.CNAME -> NAME
%import common.NUMBER
%import common.WS
%ignore WS
"""
parser = Lark(grammar)
tree = parser.parse(content)
for block in tree.find_data("block"):
tag_name = list(block.find_data("tag_start"))[0].children[0]
print(tag_name.center(80, '-'))
for assignment in block.find_data("assignment"):
var_name = assignment.children[0].children[0]
value = assignment.children[1].children[0]
print(var_name, "=>", value)
上面的示例并不打算成为一个涵盖所有细节的完整示例,而只是一个关于使用现代解析库解决这些简单问题有多容易的小示例。我将把它作为一个简单的练习留给您,让您调整代码并使用lark来满足您的需要。也许您可以像这样使用正则表达式:
-------------------------------------THING1-------------------------------------
var1 => 0
var2 => "0.0 100.0 0.0"
var3 => "IDENTIFYING_WORD"
var4 => 2
-------------------------------------THING2-------------------------------------
var1 => 0
import re
def get_value(y):
if 'var1' in y or 'var3' in y or 'var4' in y:
return_value = y.split('=')[1].strip()
try:
return int(return_value)
except ValueError:
return return_value
elif 'var2' in y:
return_value = y.split('=')[1].strip().split(" ")
return [float(i.replace('"','')) for i in return_value]
string = """
<THING1> \
var1 = 0 \#
var2 = "0.0 100.0 0.0" \#
var3 = "IDENTIFYING_WORD" \#
var4 = 2 \#
</THING1>
<THING2> \
var1 = 5 \#
var2 = "0.0 100.0 0.0" \#
var3 = "IDENTIFYING_WORD" \#
var4 = 7 \#
</THING2>
"""
pat = re.compile(r'<THING\d>(.*?)</THING\d>')
x = re.findall(pat, string.replace('\n',''))
mainlist = [['var1','var2','var3','var4']]
for i in x:
mylist = []
for j in i.split(r'\#'):
if j.strip() != '':
mylist.append(get_value(j))
mainlist.append(mylist)
print(mainlist)
我的错误。它们是向前的斜杠“,是的。它们的形式总是“(空间)\”
-------------------------------------THING1-------------------------------------
var1 => 0
var2 => "0.0 100.0 0.0"
var3 => "IDENTIFYING_WORD"
var4 => 2
-------------------------------------THING2-------------------------------------
var1 => 0
import re
def get_value(y):
if 'var1' in y or 'var3' in y or 'var4' in y:
return_value = y.split('=')[1].strip()
try:
return int(return_value)
except ValueError:
return return_value
elif 'var2' in y:
return_value = y.split('=')[1].strip().split(" ")
return [float(i.replace('"','')) for i in return_value]
string = """
<THING1> \
var1 = 0 \#
var2 = "0.0 100.0 0.0" \#
var3 = "IDENTIFYING_WORD" \#
var4 = 2 \#
</THING1>
<THING2> \
var1 = 5 \#
var2 = "0.0 100.0 0.0" \#
var3 = "IDENTIFYING_WORD" \#
var4 = 7 \#
</THING2>
"""
pat = re.compile(r'<THING\d>(.*?)</THING\d>')
x = re.findall(pat, string.replace('\n',''))
mainlist = [['var1','var2','var3','var4']]
for i in x:
mylist = []
for j in i.split(r'\#'):
if j.strip() != '':
mylist.append(get_value(j))
mainlist.append(mylist)
print(mainlist)
[
['var1', 'var2', 'var3', 'var4'],
[0, [0.0, 100.0, 0.0], '"IDENTIFYING_WORD"', 2],
[5, [0.0, 100.0, 0.0], '"IDENTIFYING_WORD"', 7]
]