Python 正在读取的日志文件<；键>=&书信电报；价值>；_Python_Pandas

Python 正在读取的日志文件<；键>=&书信电报；价值>；

python pandas

Python 正在读取的日志文件<；键>=&书信电报；价值>；,python,pandas,Python,Pandas,出于某种奇怪的原因，我不得不读取以下形式的日志文件： Tue Apr 3 08:51:05 2018 foo=123 bar=321 spam=eggs msg="String with spaces in it" Tue Apr 3 10:31:46 2018 foo=111 bar=222 spam=eggs msg="Different string with spaces" ... 我想将其作为以下数据帧阅读： bar foo m

出于某种奇怪的原因，我不得不读取以下形式的日志文件：

Tue Apr  3 08:51:05 2018 foo=123 bar=321 spam=eggs msg="String with spaces in it"
Tue Apr  3 10:31:46 2018 foo=111 bar=222 spam=eggs msg="Different string with spaces"
...

我想将其作为以下数据帧阅读：

   bar  foo                       msg  spam                      time
0  321  123  String with spaces in it  eggs  Tue Apr  3 08:51:05 2018
1  222  111          Different string  eggs  Tue Apr  3 10:31:46 2018
...

# Concat serieses into a dataframe
df = pd.concat(sers, axis=1).T
# Change the type of 'log_time' to an actual date
df['log_time'] = pd.to_datetime(df['log_time'], format='%a %b  %d %X %Y', exact=True)

   bar  foo                       msg  spam            log_time
0  321  123  String with spaces in it  eggs 2018-04-03 08:51:05
1  222  111          Different string  eggs 2018-04-03 10:31:46

其中每个

都有自己的列&然后开始的日期有自己的列

time

是否有处理此问题的

方法？（或仅=
部分？

或者，至少，有没有比正则表达式更好的方法将这一切分割成熊猫可以接受的形式？
多亏@edourdtheron和模块shlex
朝着正确的方向轻推
如果您有更好的解决方案，请随时回答
但是，我想到的是，首先，导入库：
import shlex
import pandas as pd

创建一些示例数据：
# Example data
test_string = """
Tue Apr  3 08:51:05 2018 foo=123 bar=321 spam=eggs msg="String with spaces in it"
Tue Apr  3 10:31:46 2018 foo=111 bar=222 spam=eggs msg="Different string"
"""

创建与整行匹配但将其分组的正则表达式
1：开始日期（（？：[a-zA-Z]{3,4}）{2}\d\d\d:\d\d\d{4}）

2：其他所有内容（.*）

循环测试字符串中的行并应用正则表达式，然后使用shlex

sers = []
for line in test_string.split('\n'):

    matt = re.match(patt, line)
    if not matt:
        # skip the empty lines
        continue
    # Extract Groups
    time, key_values = matt.groups()

    ser = pd.Series(dict(token.split('=', 1) for token in shlex.split(key_values)))
    ser['log_time'] = time
    sers.append(ser)

最后，将所有行连接到单个数据帧中：
   bar  foo                       msg  spam                      time
0  321  123  String with spaces in it  eggs  Tue Apr  3 08:51:05 2018
1  222  111          Different string  eggs  Tue Apr  3 10:31:46 2018
...

# Concat serieses into a dataframe
df = pd.concat(sers, axis=1).T
# Change the type of 'log_time' to an actual date
df['log_time'] = pd.to_datetime(df['log_time'], format='%a %b  %d %X %Y', exact=True)

   bar  foo                       msg  spam            log_time
0  321  123  String with spaces in it  eggs 2018-04-03 08:51:05
1  222  111          Different string  eggs 2018-04-03 10:31:46

这将生成以下数据帧：
   bar  foo                       msg  spam                      time
0  321  123  String with spaces in it  eggs  Tue Apr  3 08:51:05 2018
1  222  111          Different string  eggs  Tue Apr  3 10:31:46 2018
...

# Concat serieses into a dataframe
df = pd.concat(sers, axis=1).T
# Change the type of 'log_time' to an actual date
df['log_time'] = pd.to_datetime(df['log_time'], format='%a %b  %d %X %Y', exact=True)

   bar  foo                       msg  spam            log_time
0  321  123  String with spaces in it  eggs 2018-04-03 08:51:05
1  222  111          Different string  eggs 2018-04-03 10:31:46

您是否尝试在dict
中提取所有键/值对？请注意，该值的类型必须为list
（或序列）。那么您只需执行df=pd.DataFrame（data=my_dict）
@edouardtheron有没有一种简单的方法可以在没有正则表达式的情况下从键/值转换为字典？（我会用谷歌搜索）。另外，key=value
是我一直在寻找的，考虑到你正在解析一个表示日志文件中一行的字符串，我不知道为什么我的大脑会转到property=value
：pairs=line.split（'''）my_dict={}for pair in pair:key=pair.split（'='）[0]value=pair.split（'='）[1]my_dict[key]=value
这样行吗？对不起，注释中的格式太糟糕了。。。但你明白了。编辑：而且，它不会工作，因为您指定字符串中有空格。Nevermind@edouardtheron这是我的第一个想法，但是key=values
可以在值中包含空格和/或=
s。因此，行.split（“”）
不太有效。