使用python读取.txt数据
我有一个.txt文件,如下所示:使用python读取.txt数据,python,text-extraction,data-extraction,Python,Text Extraction,Data Extraction,我有一个.txt文件,如下所示: # 经纬度 x1 = 11.21 x2 = 11.51 y1 = 27.84 y2 = 10.08 time: 201510010000 变量名: val1 [1.1,1.2,1.3] 变量名: va2 [1.0,1.01,1.02] time: 201510010100 变量名: val1 [2.1,2.2,2.3] 变量名: va2 [2.01,2.02,2.03] time: 2015020000
# 经纬度
x1 = 11.21
x2 = 11.51
y1 = 27.84
y2 = 10.08
time: 201510010000
变量名: val1
[1.1,1.2,1.3]
变量名: va2
[1.0,1.01,1.02]
time: 201510010100
变量名: val1
[2.1,2.2,2.3]
变量名: va2
[2.01,2.02,2.03]
time: 2015020000
变量名: val1
[3.0,3.1,3.2]
变量名: val2
[3.01,3.02,3.03]
time: 2015020100
变量名: val1
[4.0,4.1,4.2]
变量名: val2
[401,4.02,4.03]
我希望使用python阅读它,如下所示:
# 经纬度
x1 = 11.21
x2 = 11.51
y1 = 27.84
y2 = 10.08
time: 201510010000
变量名: val1
[1.1,1.2,1.3]
变量名: va2
[1.0,1.01,1.02]
time: 201510010100
变量名: val1
[2.1,2.2,2.3]
变量名: va2
[2.01,2.02,2.03]
time: 2015020000
变量名: val1
[3.0,3.1,3.2]
变量名: val2
[3.01,3.02,3.03]
time: 2015020100
变量名: val1
[4.0,4.1,4.2]
变量名: val2
[401,4.02,4.03]
这就是我所做的,但我不知道下一步该怎么做
我怎样才能找到它?我建议您更改.txt格式并转换为.ini文件或.csv文件。 无论如何,你可以用字典
dict = {}
file = open("file.txt")
text = file.readline()
i=0
for i in range (text.lenght):
if text[i][0:5]=="time":
dict[text[i]] = []
dict[text[i]].append(text[i+2])
dict[text[i]].append(text[i+4])
该代码可能适用于您的文件,但如果您更改格式,将更容易在dict中存储数据。
我希望我能有所帮助。要获得所需格式的数据,您可以将相关部分添加到字典中,然后将其转换为数据帧:
import ast
import pandas as pd
with open('text.txt','r', encoding='utf-8') as f:
lines = f.readlines()
d = {"time":[],
"val1":[],
"val2":[]}
for i, line in enumerate(lines):
if line[:5] == "time:":
time = line.strip().split()[-1]
#Reading string representations of lists as lists
v1 = ast.literal_eval(lines[i+2].strip())
v2 = ast.literal_eval(lines[i+4].strip())
#Counting number of vals per date
n1 = len(v1)
n2 = len(v2)
#Padding values if any are missing
if n1 > n2:
v2 += [None] * n1-n2
elif n2 > n1:
v1 += [None] * n2-n1
d["time"].extend([time] * max(n1,n2))
d["val1"].extend(v1)
d["val2"].extend(v2)
df = pd.DataFrame(d)
我正在学习python,这就是我想到的:) 阅读解决方案并发现错误的人,请友好地指出
time = ""
val1 = []
val2 = []
final_list = []
process_val1 = False
process_val2 = False
with open('read.txt','r',encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
try:
line = line.strip()
if val1 and val2 and time != '':
for v1, v2 in zip(val1, val2):
final_list.append([time, v1, v2])
val1 = []
val2 = []
time = ''
continue
if process_val1 == True:
val1 = line.split('[')[1].split(']')[0].split(',')
process_val1 = False
continue
if process_val2 == True:
val2 = line.split('[')[1].split(']')[0].split(',')
process_val2 = False
continue
if 'time:' in line:
time = line.split(": ")[1]
continue
elif 'val1' in line:
process_val1 = True
continue
elif 'val2' in line:
process_val2 = True
continue
elif 'va2' in line:
process_val2 = True
continue
else:
continue
except:
#handle exception here
pass
if final_list:
with open('write.txt', 'w') as w:
for list in final_list:
w.write(", ".join(list) + '\n')
首先,根据您的描述,我假设x1、x2、y1和y2如下“经纬度" 对你来说没有任何意义 假设您向我们展示的图片中的数据是您想要的,并且原始数据的格式如示例所示(例如,只有两个数据列,即val1和val2;val1和val2每个时间戳始终有3个值;val2总是在val1之后),那么以下解决方案应该可以工作:
import re
#define 4 patterns
p1=r'time:\s*(\d+)' # for time: 201510010000
p2=r'\[([\d\.]+),([\d\.]+),([\d\.]+)\]' # for [1.1,2.1,3.1]
v1p=u'变量名:\s*val1' # for val1
v2p=u'变量名:\s*val2' # for val2
inV1=False # the flag to show if next line is for val1
inV2=False # the flag to show if next line is for val1
time_column=''
csv_f=open('output.csv','w',encoding='utf-8') #open a csv file for writing
csv_f.write('time,val1,val2')
with open('text.txt','r',encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
m=re.match(p1,line)
if m and time_column != m.groups()[0]:
time_column = m.groups()[0]
#reset the flags
inV1=False
inV2=False
continue
if re.match(v1p,line):
inV1=True
continue
if re.match(v2p,line):
inV2=True
continue
m=re.match(p2,line)
if not m: continue
if inV1:
val1=m.groups()
if inV2: # we should ouput all the values for a timestamp since both val2 and val1 are ready now
val2=m.groups()
for i in range(0,3):
l="{0},{1},{2}".format(time_column,val1[i],val2[i])
csv_f.write("\n"+l)
csv_f.close() #close the csv file
上面的代码所做的是解析给定的文本,并将格式化的输出写入名为“output.csv”的csv文件,该文件与“text.txt”位于同一文件夹中。您可以直接使用MS Excel或任何其他spreedsheet编辑器或查看器打开它
我在这里使用regex是因为它最灵活,您可以随时修改模式以满足您的需要,而无需更改其余的逻辑。此外,使用标志的优点是不会被文本中可能重复的行所混淆
如果您有进一步的要求,请留下评论。下一步是什么?…我个人会将数据导出到.csv或.asc文件。只是一堆格式解析。数据结构很复杂,因此,我担心将数据导出到.csv不起作用。您有
a.txt
并打开text.txt
文件阅读a你能澄清你的问题吗?谢谢你指出错误,文件名是我的本地文件夹中的text.txt。谢谢你的帮助。非常漂亮和简单的代码。谢谢你的帮助。这是一个神奇的方法,效果很好。再次感谢。我有一个问题。“尝试和排除函数“是必需的吗?因为当我删除“try and except functions”时,代码运行良好。谢谢:)。Try和except块用于处理代码中的异常。它现在可能工作得很好,因为我尝试根据您提供的示例创建代码。但是我不确定如果数据的格式改变,它会有什么反应,因此添加了try-except块。再次感谢,谢谢你的帮助。我有一些问题。1) 3号线使用readlines是否更好;2) 第5行中“attribute”lenght“的意思是什么,它有一个AttributeError:“list”对象没有属性“lenght”
import re
#define 4 patterns
p1=r'time:\s*(\d+)' # for time: 201510010000
p2=r'\[([\d\.]+),([\d\.]+),([\d\.]+)\]' # for [1.1,2.1,3.1]
v1p=u'变量名:\s*val1' # for val1
v2p=u'变量名:\s*val2' # for val2
inV1=False # the flag to show if next line is for val1
inV2=False # the flag to show if next line is for val1
time_column=''
csv_f=open('output.csv','w',encoding='utf-8') #open a csv file for writing
csv_f.write('time,val1,val2')
with open('text.txt','r',encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
m=re.match(p1,line)
if m and time_column != m.groups()[0]:
time_column = m.groups()[0]
#reset the flags
inV1=False
inV2=False
continue
if re.match(v1p,line):
inV1=True
continue
if re.match(v2p,line):
inV2=True
continue
m=re.match(p2,line)
if not m: continue
if inV1:
val1=m.groups()
if inV2: # we should ouput all the values for a timestamp since both val2 and val1 are ready now
val2=m.groups()
for i in range(0,3):
l="{0},{1},{2}".format(time_column,val1[i],val2[i])
csv_f.write("\n"+l)
csv_f.close() #close the csv file