Python 如何解析日志文本文件、解析日期时间以及获取时间增量之和
我尝试了各种方法来打开文件并将其作为一个整体传递。但是我做不到。输出为零或为空 我有一个包含以下数据的日志文件: 如何通过解析时间日志文件来计算花费的总时间?我无法将该文件作为一个整体进行分析 我试过:Python 如何解析日志文本文件、解析日期时间以及获取时间增量之和,python,regex,datetime,timedelta,Python,Regex,Datetime,Timedelta,我尝试了各种方法来打开文件并将其作为一个整体传递。但是我做不到。输出为零或为空 我有一个包含以下数据的日志文件: 如何通过解析时间日志文件来计算花费的总时间?我无法将该文件作为一个整体进行分析 我试过: import re import datetime text="""5/1/12: 3:39am - 4:43am data file study 3:57pm - 5:06pm bg ui, combo boxes
import re
import datetime
text="""5/1/12: 3:39am - 4:43am data file study
3:57pm - 5:06pm bg ui, combo boxes
7:44pm - 8:50pm bg ui with scaler; slider
10:30pm - 12:48am state texts; slider
5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
5/8/12: 11:00pm - 11:40pm mapMC,"""
total=re.findall("(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)",text)
print(sum([datetime.datetime.strptime(t[1],"%I:%M%p")-datetime.datetime.strptime(t[0],"%I:%M%p") for t in total],datetime.timedelta()))
执行此操作时,我以负数格式获取时间。如何处理它?要考虑时间重叠的天数,您必须分别计算这两天的持续时间并将其相加。
请参考下面的代码
重新导入
从datetime导入datetime作为dt,timedelta作为td
strp=dt.strtime
以open(“log.txt”、“r”)作为f:
total=re.findall((\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)”,f.read()
如果strp(t[1],%I:%M%p”)>strp(t[0],%I:%M%p),则打印(总和([strp(t[1],%I:%M%p”)-strp(t[0],%I:%M%p”)+(strp(t[1],%I:%M%p”)-strp(t[12:00am],%I:%M%p”)+td(总分钟数=1),则打印(总和)
输出
4 days, 9:13:00
您可以在Panda数据框中解析日志文件,然后轻松进行计算:
import pandas as pd
import dateparser
x="""5/1/12: 3:39am - 4:43am data file study
3:57pm - 5:06pm bg ui, combo boxes
7:44pm - 8:50pm bg ui with scaler; slider
10:30pm - 12:48am state texts; slider
5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
5/8/12: 11:00pm - 11:40pm mapMC,
5/9/12: 3:05pm - 3:42pm wholeMapMC, subMapMC, AS3 functions reading
10:35pm - 1:33am whole view data; scrollpane;
5/10/12: 6:10pm - 8:13pm blue slider
5/11/12: 8:45am - 12:10pm purple slider
1:30pm - 5:00pm Nitrate bar
11:18pm - 12:03am change NitrogenViewBase to static
5/12/12: 8:06am - 9:47am correct data and change NitrogenViewBase to static
5:45pm - 8:00pm costs bar, embed font
9:51pm - 12:31am costs bar
5/13/12: 7:45am - 8:45am read the Nitrogen Game doc
5/15/12: 2:07am - 5:09am corn
2:06pm - 5:11pm hypoxic zone
5/16/12: 2:53pm - 5:09pm data re-structure
7:00pm - 9:10pm sub sections watershed data
5/17/12: 12:30am - 2:32am sub sections sliders
10:30am - 11:45am meet with Dr. Lant and Blanca
3:09pm - 5:05pm crop yield and sub sections pink bar
7:00pm - 7:50pm sub sections nitrate to gulf bar
5/18/12: 3:15pm - 3:52pm sub sections slider legend
5/27/12: 5:46pm - 7:30pm feedback fixes
6/20/12: 2:57pm - 5:00pm Teachers' feedback fixes
7:30pm - 8:30pm
6/22/12: 3:40pm - 5:00pm
6/25/12: 3:24pm - 5:00pm
6/26/12: 11:24am - 12:35pm
7/4/12: 1:00pm - 10:00pm research on combobox with dropdown subitem - to no avail
7/5/12: 1:30am - 3:00am continue the research
9:31am - 12:45pm experiment on the combobox-subitem concept
3:45pm - 5:00pm
6:23pm - 8:14pm give up
8:18pm - 10:00pm zone change
11:07pm - 12:00am
7/10/12: 11:32am - 12:03pm added BASE_X and BASE_Y to the NitrogenSubView
4:15pm - 5:05pm fine-tune the whole view map
7:36pm - 8:46pm
7/11/12: 1:38am - 4:42am
7/31/12: 11:26am - 1:18pm study photoshop path shape
8/1/12: 2:00am - 3:41am collect the coordinates of wetland shapes
10:31am - 11:40am restorable wetlands implementation
4:00pm - 5:00pm
8/2/12: 12:20am - 4:42am
8/10/12: 2:30am - 4:55am sub watersheds color match; wetland color & size change
3/13/13: 6:06pm - 6:32pm Make the numbers in the triangle sliders bigger and bolder; Larger font on "Crop Yield Reduction"
"""
#We will store records there
records = []
#Loop through lines
for line in x.split("\n"):
#Look for a date in line
match_date = re.search(r'(\d+/\d+/\d+)',line)
if match_date!=None:
#If a date exists, store it in a variable
date = match_date.group(1)
#Extract times
times = re.findall("(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)",line)
#if there's no valid time in the line, skip it
if len(times) == 0: continue
#parse dates
start = dateparser.parse(date + " " + times[0][0], languages=['en'])
end = dateparser.parse(date + " " + times[0][1], languages=['en'])
content =line.split(times[0][1])[-1].strip()
#Append records
records.append(dict(date=date, start= start, end = end, content =content))
df = pd.DataFrame(records)
#Correct end time if it's lower than start time
df.loc[df.start>df.end,"end"] = df[df.start>df.end].end + timedelta(days=1)
print("Total spent time :", (df.end - df.start).sum())
输出
Total spent time : 4 days 09:13:00
Liju和Sebastien D已经提供了两个有趣且有效的解决方案。在这里,我提出了两个新的变体,它们虽然相似,但具有重要的性能优势 目前的两种解决方案以这种方式解决问题:
- Liju提出的解决方案是:获取正则表达式匹配项,并对通过列表理解创建的列表求和。在理解过程中,它将相同的两个字符串解析为datetime三次(计算
,输出
,或输出if
)else
- Sebastien D提出的
解决方案:获取每行文本,并尝试在该行外正则化一个日期,然后尝试从该行中查找开始/结束时间(可以改进为单个正则化,但正则化不是此解决方案的瓶颈)。然后,它使用dateparser
组合日期和时间,并收集文本描述。这将更类似于一个成熟的解析器,但出于时间测试的目的,我删除了描述功能dateparser
- 通过
:与two\u pass解决方案
类似,但在第一个过程中,它只是将字符串解析为datetime,在第二个过程中,它计算one\u pass
并对正确的时间增量求和。它的主要优点是只解析一次日期,缺点是必须迭代两次start>end
- 通过
解决方案:类似于pure\u pandas
,但只调用regex一次,并使用pandas内置的dateparser
进行解析对datetime
w_dateparser
是迄今为止性能最低的解决方案
如果我们放大比较其他三种解决方案,我们会发现w_pure_pandas
对于较小的文本长度比其他解决方案慢一点,但它通过利用numpy C实现(与其他解决方案使用的列表理解相反)在比较较长的条目方面表现出色。其次,two_-pass
通常比one_-pass
快,对于较长的文本,速度也越来越快
two\u pass
和w\u pure\u pandas
的代码:
def two_pass(text):
total = re.findall(r"(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)", text)
total = [
(datetime.datetime.strptime(t[0], '%I:%M%p'),
datetime.datetime.strptime(t[1], '%I:%M%p'))
for t in total
]
return sum(
(
end - start if end > start
else end - start + datetime.timedelta(days=1)
for start, end in total
)
, datetime.timedelta()
)
def w_pure_pandas(text):
import pandas as pd
total = re.findall(r"(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)", text)
df = pd.DataFrame(total, columns=['start', 'end'])
for col in df:
# pandas.to_datetime has issues with date compatibility
# but since we only care for time deltas,
# we can just use the default behavior
df[col] = pd.to_datetime(df[col])
df.loc[df.start > df.end, 'end'] += datetime.timedelta(days=1)
return df.diff(axis=1).sum()['end']
所有解决方案和时间测试的完整代码:
import re
import datetime
import timeit
from matplotlib import pyplot as plt
text = '''
Time Log Nitrogen:
5/1/12: 3:39am - 4:43am data file study
3:57pm - 5:06pm bg ui, combo boxes
7:44pm - 8:50pm bg ui with scaler; slider
10:30pm - 12:48am state texts; slider
5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
5/8/12: 11:00pm - 11:40pm mapMC,
5/9/12: 3:05pm - 3:42pm wholeMapMC, subMapMC, AS3 functions reading
10:35pm - 1:33am whole view data; scrollpane;
5/10/12: 6:10pm - 8:13pm blue slider
5/11/12: 8:45am - 12:10pm purple slider
1:30pm - 5:00pm Nitrate bar
11:18pm - 12:03am change NitrogenViewBase to static
5/12/12: 8:06am - 9:47am correct data and change NitrogenViewBase to static
5:45pm - 8:00pm costs bar, embed font
9:51pm - 12:31am costs bar
5/13/12: 7:45am - 8:45am read the Nitrogen Game doc
5/15/12: 2:07am - 5:09am corn
2:06pm - 5:11pm hypoxic zone
5/16/12: 2:53pm - 5:09pm data re-structure
7:00pm - 9:10pm sub sections watershed data
5/17/12: 12:30am - 2:32am sub sections sliders
10:30am - 11:45am meet with Dr. Lant and Blanca
3:09pm - 5:05pm crop yield and sub sections pink bar
7:00pm - 7:50pm sub sections nitrate to gulf bar
5/18/12: 3:15pm - 3:52pm sub sections slider legend
5/27/12: 5:46pm - 7:30pm feedback fixes
6/20/12: 2:57pm - 5:00pm Teachers' feedback fixes
7:30pm - 8:30pm
6/22/12: 3:40pm - 5:00pm
6/25/12: 3:24pm - 5:00pm
6/26/12: 11:24am - 12:35pm
7/4/12: 1:00pm - 10:00pm research on combobox with dropdown subitem - to no avail
7/5/12: 1:30am - 3:00am continue the research
9:31am - 12:45pm experiment on the combobox-subitem concept
3:45pm - 5:00pm
6:23pm - 8:14pm give up
8:18pm - 10:00pm zone change
11:07pm - 12:00am
7/10/12: 11:32am - 12:03pm added BASE_X and BASE_Y to the NitrogenSubView
4:15pm - 5:05pm fine-tune the whole view map
7:36pm - 8:46pm
7/11/12: 1:38am - 4:42am
7/31/12: 11:26am - 1:18pm study photoshop path shape
8/1/12: 2:00am - 3:41am collect the coordinates of wetland shapes
10:31am - 11:40am restorable wetlands implementation
4:00pm - 5:00pm
8/2/12: 12:20am - 4:42am
8/10/12: 2:30am - 4:55am sub watersheds color match; wetland color & size change
3/13/13: 6:06pm - 6:32pm Make the numbers in the triangle sliders
bigger and bolder; Larger font on "Crop Yield Reduction"
'''
def one_pass(text):
total = re.findall(r"(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)", text)
return sum(
[
datetime.datetime.strptime(t[1], '%I:%M%p')
- datetime.datetime.strptime(t[0], '%I:%M%p')
if datetime.datetime.strptime(t[1], '%I:%M%p') >
datetime.datetime.strptime(t[0], '%I:%M%p')
else
datetime.datetime.strptime('11:59pm', '%I:%M%p')
- datetime.datetime.strptime(t[0], '%I:%M%p')
+ datetime.datetime.strptime(t[1], '%I:%M%p')
- datetime.datetime.strptime('12:00am', '%I:%M%p')
+ datetime.timedelta(minutes=1)
for t in total
]
, start=datetime.timedelta()
)
def w_dateparser(text):
import pandas as pd
import dateparser
#We will store records there
records = []
#Loop through lines
# t0 = t1 = t2 = 0
for line in text.split("\n"):
#Look for a date in line
# t0 = time() - t0
match_date = re.search(r'(\d+/\d+/\d+)',line)
if match_date!=None:
#If a date exists, store it in a variable
date = match_date.group(1)
#Extract times
times = re.findall("(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)",line)
# t0 = time() - t0
#if there's no valid time in the line, skip it
if len(times) == 0: continue
# t1 = time() - t1
#parse dates
start = dateparser.parse(date + " " + times[0][0], languages=['en'])
end = dateparser.parse(date + " " + times[0][1], languages=['en'])
# content = line.split(times[0][1])[-1].strip()
# t1 = time() - t1
#Append records
# records.append(dict(date=date, start= start, end = end, content =content))
records.append(dict(date=date, start= start, end = end))
# t2 = time() - t2
df = pd.DataFrame(records)
# print(df)
#Correct end time if it's lower than start time
df.loc[df.start>df.end,"end"] = df[df.start>df.end].end + datetime.timedelta(days=1)
# t2 = time() - t2
# print(t0, t1, t2)
return (df.end - df.start).sum()
def two_pass(text):
total = re.findall(r"(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)", text)
total = [
(datetime.datetime.strptime(t[0], '%I:%M%p'),
datetime.datetime.strptime(t[1], '%I:%M%p'))
for t in total
]
return sum(
(
end - start if end > start
else end - start + datetime.timedelta(days=1)
for start, end in total
)
, datetime.timedelta()
)
def w_pure_pandas(text):
import pandas as pd
total = re.findall(r"(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)", text)
df = pd.DataFrame(total, columns=['start', 'end'])
for col in df:
# pandas.to_datetime has issues with date compatibility
# but since we only care for time deltas,
# we can just use the default behavior
df[col] = pd.to_datetime(df[col])
df.loc[df.start > df.end, 'end'] += datetime.timedelta(days=1)
return df.diff(axis=1).sum()['end']
timings = {}
for l in [1, 5, 10, 50, 100]:
text_long = text * l
n = 2
timings[l] = {}
for func in ['two_pass', 'one_pass', 'w_pure_pandas', 'w_dateparser']:
t = timeit.timeit(f"{func}(text_long)", number=n, globals=globals()) / n
timings[l][func] = t
timings = pd.DataFrame(timings).T
timings.info()
print(timings)
timings.plot()
plt.xlabel('multiplier for lines of text')
plt.ylabel('runtime (s)')
plt.grid(True)
plt.show()
plt.close('all')
timings[['two_pass', 'one_pass', 'w_pure_pandas']].plot()
plt.xlabel('multiplier for lines of text')
plt.ylabel('runtime (s)')
plt.grid(True)
plt.show()
plt.close('all')
在regex模式前面放一个
r
,这会找到匹配项吗?你将如何解释跨越午夜的时段。。。我建议创建一个包含datetime开始/结束对象的元组列表,然后在end。我是Python初学者。你的日期格式是什么?@SebastienD它是mm/dd/yy
import re
import datetime
import timeit
from matplotlib import pyplot as plt
text = '''
Time Log Nitrogen:
5/1/12: 3:39am - 4:43am data file study
3:57pm - 5:06pm bg ui, combo boxes
7:44pm - 8:50pm bg ui with scaler; slider
10:30pm - 12:48am state texts; slider
5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
5/8/12: 11:00pm - 11:40pm mapMC,
5/9/12: 3:05pm - 3:42pm wholeMapMC, subMapMC, AS3 functions reading
10:35pm - 1:33am whole view data; scrollpane;
5/10/12: 6:10pm - 8:13pm blue slider
5/11/12: 8:45am - 12:10pm purple slider
1:30pm - 5:00pm Nitrate bar
11:18pm - 12:03am change NitrogenViewBase to static
5/12/12: 8:06am - 9:47am correct data and change NitrogenViewBase to static
5:45pm - 8:00pm costs bar, embed font
9:51pm - 12:31am costs bar
5/13/12: 7:45am - 8:45am read the Nitrogen Game doc
5/15/12: 2:07am - 5:09am corn
2:06pm - 5:11pm hypoxic zone
5/16/12: 2:53pm - 5:09pm data re-structure
7:00pm - 9:10pm sub sections watershed data
5/17/12: 12:30am - 2:32am sub sections sliders
10:30am - 11:45am meet with Dr. Lant and Blanca
3:09pm - 5:05pm crop yield and sub sections pink bar
7:00pm - 7:50pm sub sections nitrate to gulf bar
5/18/12: 3:15pm - 3:52pm sub sections slider legend
5/27/12: 5:46pm - 7:30pm feedback fixes
6/20/12: 2:57pm - 5:00pm Teachers' feedback fixes
7:30pm - 8:30pm
6/22/12: 3:40pm - 5:00pm
6/25/12: 3:24pm - 5:00pm
6/26/12: 11:24am - 12:35pm
7/4/12: 1:00pm - 10:00pm research on combobox with dropdown subitem - to no avail
7/5/12: 1:30am - 3:00am continue the research
9:31am - 12:45pm experiment on the combobox-subitem concept
3:45pm - 5:00pm
6:23pm - 8:14pm give up
8:18pm - 10:00pm zone change
11:07pm - 12:00am
7/10/12: 11:32am - 12:03pm added BASE_X and BASE_Y to the NitrogenSubView
4:15pm - 5:05pm fine-tune the whole view map
7:36pm - 8:46pm
7/11/12: 1:38am - 4:42am
7/31/12: 11:26am - 1:18pm study photoshop path shape
8/1/12: 2:00am - 3:41am collect the coordinates of wetland shapes
10:31am - 11:40am restorable wetlands implementation
4:00pm - 5:00pm
8/2/12: 12:20am - 4:42am
8/10/12: 2:30am - 4:55am sub watersheds color match; wetland color & size change
3/13/13: 6:06pm - 6:32pm Make the numbers in the triangle sliders
bigger and bolder; Larger font on "Crop Yield Reduction"
'''
def one_pass(text):
total = re.findall(r"(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)", text)
return sum(
[
datetime.datetime.strptime(t[1], '%I:%M%p')
- datetime.datetime.strptime(t[0], '%I:%M%p')
if datetime.datetime.strptime(t[1], '%I:%M%p') >
datetime.datetime.strptime(t[0], '%I:%M%p')
else
datetime.datetime.strptime('11:59pm', '%I:%M%p')
- datetime.datetime.strptime(t[0], '%I:%M%p')
+ datetime.datetime.strptime(t[1], '%I:%M%p')
- datetime.datetime.strptime('12:00am', '%I:%M%p')
+ datetime.timedelta(minutes=1)
for t in total
]
, start=datetime.timedelta()
)
def w_dateparser(text):
import pandas as pd
import dateparser
#We will store records there
records = []
#Loop through lines
# t0 = t1 = t2 = 0
for line in text.split("\n"):
#Look for a date in line
# t0 = time() - t0
match_date = re.search(r'(\d+/\d+/\d+)',line)
if match_date!=None:
#If a date exists, store it in a variable
date = match_date.group(1)
#Extract times
times = re.findall("(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)",line)
# t0 = time() - t0
#if there's no valid time in the line, skip it
if len(times) == 0: continue
# t1 = time() - t1
#parse dates
start = dateparser.parse(date + " " + times[0][0], languages=['en'])
end = dateparser.parse(date + " " + times[0][1], languages=['en'])
# content = line.split(times[0][1])[-1].strip()
# t1 = time() - t1
#Append records
# records.append(dict(date=date, start= start, end = end, content =content))
records.append(dict(date=date, start= start, end = end))
# t2 = time() - t2
df = pd.DataFrame(records)
# print(df)
#Correct end time if it's lower than start time
df.loc[df.start>df.end,"end"] = df[df.start>df.end].end + datetime.timedelta(days=1)
# t2 = time() - t2
# print(t0, t1, t2)
return (df.end - df.start).sum()
def two_pass(text):
total = re.findall(r"(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)", text)
total = [
(datetime.datetime.strptime(t[0], '%I:%M%p'),
datetime.datetime.strptime(t[1], '%I:%M%p'))
for t in total
]
return sum(
(
end - start if end > start
else end - start + datetime.timedelta(days=1)
for start, end in total
)
, datetime.timedelta()
)
def w_pure_pandas(text):
import pandas as pd
total = re.findall(r"(\d{1,2}:\d{1,2}[ap]m)\s*-\s*(\d{1,2}:\d{1,2}[ap]m)", text)
df = pd.DataFrame(total, columns=['start', 'end'])
for col in df:
# pandas.to_datetime has issues with date compatibility
# but since we only care for time deltas,
# we can just use the default behavior
df[col] = pd.to_datetime(df[col])
df.loc[df.start > df.end, 'end'] += datetime.timedelta(days=1)
return df.diff(axis=1).sum()['end']
timings = {}
for l in [1, 5, 10, 50, 100]:
text_long = text * l
n = 2
timings[l] = {}
for func in ['two_pass', 'one_pass', 'w_pure_pandas', 'w_dateparser']:
t = timeit.timeit(f"{func}(text_long)", number=n, globals=globals()) / n
timings[l][func] = t
timings = pd.DataFrame(timings).T
timings.info()
print(timings)
timings.plot()
plt.xlabel('multiplier for lines of text')
plt.ylabel('runtime (s)')
plt.grid(True)
plt.show()
plt.close('all')
timings[['two_pass', 'one_pass', 'w_pure_pandas']].plot()
plt.xlabel('multiplier for lines of text')
plt.ylabel('runtime (s)')
plt.grid(True)
plt.show()
plt.close('all')