Python最干净的解析方法
A有几行日志,格式是“TimeA:0.216/1,TimeB:495.761/1,TimeC:2.048/2,TimeD:0.296/1”(语法是timerName:time/instances)`这就是我解析它的方式Python最干净的解析方法,python,Python,A有几行日志,格式是“TimeA:0.216/1,TimeB:495.761/1,TimeC:2.048/2,TimeD:0.296/1”(语法是timerName:time/instances)`这就是我解析它的方式 ServiceTimer = namedtuple("ServiceTimer", ["timerName", "time", "instances"]) timers = [] for entry in line.split(","): name, rest = ent
ServiceTimer = namedtuple("ServiceTimer", ["timerName", "time", "instances"])
timers = []
for entry in line.split(","):
name, rest = entry.split(":")
time, instances = rest.split("/")
timers.append(ServiceTimer(name, float(time), int(instances)))
如果有更好的方法,那么它也需要更快,因为有数以百万计的日志行。任何指针都很好。可能行数更少
for entry in line.split(','):
split_line = entry.split(":")[1].split('/')
timers.append(ServiceTimer(entry.split(':')[0],float(split_line[0]),int(split_line[1])
也许用更少的线
for entry in line.split(','):
split_line = entry.split(":")[1].split('/')
timers.append(ServiceTimer(entry.split(':')[0],float(split_line[0]),int(split_line[1])
也许用更少的线
for entry in line.split(','):
split_line = entry.split(":")[1].split('/')
timers.append(ServiceTimer(entry.split(':')[0],float(split_line[0]),int(split_line[1])
也许用更少的线
for entry in line.split(','):
split_line = entry.split(":")[1].split('/')
timers.append(ServiceTimer(entry.split(':')[0],float(split_line[0]),int(split_line[1])
根据@zaftcoAgeiha建议,使用正则表达式:
from re import finditer
line = "TimeA:0.216/1,TimeB:495.761/1,TimeC:2.048/2,TimeD:0.296/1"
[ m.groups( ) for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]
您将获得:
[('TimeA', '0.216', '1'),
('TimeB', '495.761', '1'),
('TimeC', '2.048', '2'),
('TimeD', '0.296', '1')]
对于类型转换,您可以使用group
方法:
[ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]
编辑:要解析整个文件,您需要首先编译模式,并使用列表理解而不是附加
:
from re import compile
regex = compile( r'([^,:]*):([^/]*)/([^,]*)' )
with open( 'fname.txt', 'r' ) as fin:
results = [ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
for m in regex.finditer( line ) for line in fin]
根据@zaftcoAgeiha建议,使用正则表达式:
from re import finditer
line = "TimeA:0.216/1,TimeB:495.761/1,TimeC:2.048/2,TimeD:0.296/1"
[ m.groups( ) for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]
您将获得:
[('TimeA', '0.216', '1'),
('TimeB', '495.761', '1'),
('TimeC', '2.048', '2'),
('TimeD', '0.296', '1')]
对于类型转换,您可以使用group
方法:
[ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]
编辑:要解析整个文件,您需要首先编译模式,并使用列表理解而不是附加
:
from re import compile
regex = compile( r'([^,:]*):([^/]*)/([^,]*)' )
with open( 'fname.txt', 'r' ) as fin:
results = [ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
for m in regex.finditer( line ) for line in fin]
根据@zaftcoAgeiha建议,使用正则表达式:
from re import finditer
line = "TimeA:0.216/1,TimeB:495.761/1,TimeC:2.048/2,TimeD:0.296/1"
[ m.groups( ) for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]
您将获得:
[('TimeA', '0.216', '1'),
('TimeB', '495.761', '1'),
('TimeC', '2.048', '2'),
('TimeD', '0.296', '1')]
对于类型转换,您可以使用group
方法:
[ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]
编辑:要解析整个文件,您需要首先编译模式,并使用列表理解而不是附加
:
from re import compile
regex = compile( r'([^,:]*):([^/]*)/([^,]*)' )
with open( 'fname.txt', 'r' ) as fin:
results = [ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
for m in regex.finditer( line ) for line in fin]
根据@zaftcoAgeiha建议,使用正则表达式:
from re import finditer
line = "TimeA:0.216/1,TimeB:495.761/1,TimeC:2.048/2,TimeD:0.296/1"
[ m.groups( ) for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]
您将获得:
[('TimeA', '0.216', '1'),
('TimeB', '495.761', '1'),
('TimeC', '2.048', '2'),
('TimeD', '0.296', '1')]
对于类型转换,您可以使用group
方法:
[ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]
编辑:要解析整个文件,您需要首先编译模式,并使用列表理解而不是附加
:
from re import compile
regex = compile( r'([^,:]*):([^/]*)/([^,]*)' )
with open( 'fname.txt', 'r' ) as fin:
results = [ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
for m in regex.finditer( line ) for line in fin]
我测试了三个版本:
- 没有命名元组的原始代码
- 带有类型转换的regexp示例
- 另一个带有一些速度技巧的regexp版本
def process1():
results = []
with open('temp.txt') as fptr:
for line in fptr:
for entry in line.split(','):
name, rest = entry.split(":")
time, instances = rest.split("/")
results.append((name, float(time), int(instances)))
return len(results)
def process2():
from re import finditer
results = []
with open('temp.txt') as fptr:
for line in fptr:
for match in finditer(r'([^,:]*):([^/]*)/([^,]*)', line):
results.append(
(match.group(1), float(match.group(2)), int(match.group(3))))
return len(results)
def process3():
from re import finditer
import mmap
results = []
with open('temp.txt', 'r+') as fptr:
fmap = mmap.mmap(fptr.fileno(), 0)
for match in finditer(r'([^,:]*):([^/]*)/([^,\r\n]*)', fmap):
results.append(
(match.group(1), float(match.group(2)), int(match.group(3))))
return len(results)
我在一个“temp.txt”文本文件上测试了这些函数,其中有一百万个重复的示例行。结果如下:
In [8]: %time temp.process1()
CPU times: user 10.24 s, sys: 0.00 s, total: 10.24 s
Wall time: 10.24 s
Out[8]: 4000000
In [9]: %time temp.process2()
CPU times: user 12.63 s, sys: 0.00 s, total: 12.63 s
Wall time: 12.63 s
Out[9]: 4000000
In [10]: %time temp.process3()
CPU times: user 9.43 s, sys: 0.00 s, total: 9.43 s
Wall time: 9.43 s
Out[10]: 4000000
因此,忽略逐行处理和内存映射文件的regexp版本比示例代码快7%。示例regexp代码比您的示例慢23%
故事的寓意:永远是基准。我测试了三个版本:
- 没有命名元组的原始代码
- 带有类型转换的regexp示例
- 另一个带有一些速度技巧的regexp版本
def process1():
results = []
with open('temp.txt') as fptr:
for line in fptr:
for entry in line.split(','):
name, rest = entry.split(":")
time, instances = rest.split("/")
results.append((name, float(time), int(instances)))
return len(results)
def process2():
from re import finditer
results = []
with open('temp.txt') as fptr:
for line in fptr:
for match in finditer(r'([^,:]*):([^/]*)/([^,]*)', line):
results.append(
(match.group(1), float(match.group(2)), int(match.group(3))))
return len(results)
def process3():
from re import finditer
import mmap
results = []
with open('temp.txt', 'r+') as fptr:
fmap = mmap.mmap(fptr.fileno(), 0)
for match in finditer(r'([^,:]*):([^/]*)/([^,\r\n]*)', fmap):
results.append(
(match.group(1), float(match.group(2)), int(match.group(3))))
return len(results)
我在一个“temp.txt”文本文件上测试了这些函数,其中有一百万个重复的示例行。结果如下:
In [8]: %time temp.process1()
CPU times: user 10.24 s, sys: 0.00 s, total: 10.24 s
Wall time: 10.24 s
Out[8]: 4000000
In [9]: %time temp.process2()
CPU times: user 12.63 s, sys: 0.00 s, total: 12.63 s
Wall time: 12.63 s
Out[9]: 4000000
In [10]: %time temp.process3()
CPU times: user 9.43 s, sys: 0.00 s, total: 9.43 s
Wall time: 9.43 s
Out[10]: 4000000
因此,忽略逐行处理和内存映射文件的regexp版本比示例代码快7%。示例regexp代码比您的示例慢23%
故事的寓意:永远是基准。我测试了三个版本:
- 没有命名元组的原始代码
- 带有类型转换的regexp示例
- 另一个带有一些速度技巧的regexp版本
def process1():
results = []
with open('temp.txt') as fptr:
for line in fptr:
for entry in line.split(','):
name, rest = entry.split(":")
time, instances = rest.split("/")
results.append((name, float(time), int(instances)))
return len(results)
def process2():
from re import finditer
results = []
with open('temp.txt') as fptr:
for line in fptr:
for match in finditer(r'([^,:]*):([^/]*)/([^,]*)', line):
results.append(
(match.group(1), float(match.group(2)), int(match.group(3))))
return len(results)
def process3():
from re import finditer
import mmap
results = []
with open('temp.txt', 'r+') as fptr:
fmap = mmap.mmap(fptr.fileno(), 0)
for match in finditer(r'([^,:]*):([^/]*)/([^,\r\n]*)', fmap):
results.append(
(match.group(1), float(match.group(2)), int(match.group(3))))
return len(results)
我在一个“temp.txt”文本文件上测试了这些函数,其中有一百万个重复的示例行。结果如下:
In [8]: %time temp.process1()
CPU times: user 10.24 s, sys: 0.00 s, total: 10.24 s
Wall time: 10.24 s
Out[8]: 4000000
In [9]: %time temp.process2()
CPU times: user 12.63 s, sys: 0.00 s, total: 12.63 s
Wall time: 12.63 s
Out[9]: 4000000
In [10]: %time temp.process3()
CPU times: user 9.43 s, sys: 0.00 s, total: 9.43 s
Wall time: 9.43 s
Out[10]: 4000000
因此,忽略逐行处理和内存映射文件的regexp版本比示例代码快7%。示例regexp代码比您的示例慢23%
故事的寓意:永远是基准。我测试了三个版本:
- 没有命名元组的原始代码
- 带有类型转换的regexp示例
- 另一个带有一些速度技巧的regexp版本
def process1():
results = []
with open('temp.txt') as fptr:
for line in fptr:
for entry in line.split(','):
name, rest = entry.split(":")
time, instances = rest.split("/")
results.append((name, float(time), int(instances)))
return len(results)
def process2():
from re import finditer
results = []
with open('temp.txt') as fptr:
for line in fptr:
for match in finditer(r'([^,:]*):([^/]*)/([^,]*)', line):
results.append(
(match.group(1), float(match.group(2)), int(match.group(3))))
return len(results)
def process3():
from re import finditer
import mmap
results = []
with open('temp.txt', 'r+') as fptr:
fmap = mmap.mmap(fptr.fileno(), 0)
for match in finditer(r'([^,:]*):([^/]*)/([^,\r\n]*)', fmap):
results.append(
(match.group(1), float(match.group(2)), int(match.group(3))))
return len(results)
我在一个“temp.txt”文本文件上测试了这些函数,其中有一百万个重复的示例行。结果如下:
In [8]: %time temp.process1()
CPU times: user 10.24 s, sys: 0.00 s, total: 10.24 s
Wall time: 10.24 s
Out[8]: 4000000
In [9]: %time temp.process2()
CPU times: user 12.63 s, sys: 0.00 s, total: 12.63 s
Wall time: 12.63 s
Out[9]: 4000000
In [10]: %time temp.process3()
CPU times: user 9.43 s, sys: 0.00 s, total: 9.43 s
Wall time: 9.43 s
Out[10]: 4000000
因此,忽略逐行处理和内存映射文件的regexp版本比示例代码快7%。示例regexp代码比您的示例慢23%
故事的寓意:始终使用基准测试。是否需要使用
namedtuple
?使用正则表达式组?如果您使用的是Perl,那么您可以处理这么多faster@allKid,不一定,我想知道使用字典是否会更快。是否需要使用namedtuple
?使用正则表达式组?如果您使用的是Perl,那么您可以处理这么多faster@allKid,不一定,我想知道使用字典是否会更快。是否需要使用namedtuple
?使用正则表达式组?如果您使用的是Perl,那么您可以处理这么多faster@allKid,不一定,我想知道使用字典是否会更快。是否需要使用namedtuple
?使用正则表达式组?如果您使用的是Perl,那么您可以处理这么多faster@allKid,不一定,我想知道使用字典是否会更快。故事的另一个寓意是:你不能改进你没有衡量的东西。如果速度是一个问题,我认为列表理解比append