Python 在Pandas中读取文本文件时的左贪婪与右贪婪列分配
假设我们有一个日志文件,其中包含如下行:Python 在Pandas中读取文本文件时的左贪婪与右贪婪列分配,python,pandas,csv,Python,Pandas,Csv,假设我们有一个日志文件,其中包含如下行: Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico 这不是一个CSV文件。但是,如果我假设: sep='' 3列 理论上,可以通过两种方式将此文件加载到中的数据帧中: 左贪婪 第1列和第2列分配给sep='-splits,第3列分配给每行末尾剩余的任何文本 这将导致: Col1=Mar-13-19:04:13 Col2=[错误] Col3=文件不存在:/var/ww
Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico
这不是一个CSV文件。但是,如果我假设:
sep=''
- 3列
- 第1列和第2列分配给
-splits,第3列分配给每行末尾剩余的任何文本sep='
- 这将导致:
- Col1=
Mar-13-19:04:13
- Col2=
[错误]
- Col3=
文件不存在:/var/www/favicon.ico
- Col1=
- 第2列和第3列分配给
-splits,第1列分配给每行开头剩余的任何文本sep='
- 这将导致:
- Col1=
Mar-13-19:04:13[错误]文件不存在
- Col2=
存在:
- Col3=
/var/www/favicon.ico
- Col1=
因此,我的问题是:
- 如何使用左贪婪模式将此文件加载到Pandas中李>
- 如果我在
中指定read\u csv
,熊猫是否遵循左贪婪模式?一个右贪婪的模式?还是以上都没有李>error\u bad\u lines=False
- 如果我在
from io import StringIO
data = """Mar-13-19:04:13 [error] client File does not exist: /var/www/favicon.ico"""
pd.read_csv(StringIO(data), sep = ' ', names = ['a','b','c'])
a b c
Mar-13-19:04:13 [error] client File does not exist: /var/www/favicon.ico
一种方法是读取一列中的数据,并使用regex提取所需的值
df = pd.read_csv(StringIO(data), names = ['data'])
df['data'].str.extract('(?P<a>.*)\s\[(?P<b>.*)\]\s(?P<c>.*)')
您可以将完整日志文件作为一个包含1列的数据帧读入。然后使用
str.split
和expand=True
将每个列表扩展到自己的列:
txt = StringIO('''Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico
Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico
Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico
Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico
Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico
''')
left_greedy = True
# read in text file as one big dataframe
df = pd.DataFrame(txt)
if left_greedy:
df = df[0].str.split(pat=' ', n=2, expand=True)
else:
df = df[0].str.rsplit(pat=' ', n=2, expand=True)
# assign correct column names
df.columns = ['Col1', 'Col2', 'Col3']
输出left\u greedy=True
Col1 Col2 Col3
0 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
1 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
2 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
3 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
4 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
Col1 Col2 Col3
0 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
1 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
2 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
3 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
4 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
输出left\u greedy=False
Col1 Col2 Col3
0 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
1 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
2 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
3 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
4 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
Col1 Col2 Col3
0 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
1 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
2 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
3 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
4 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico\n
备份方法适用于示例行: 您可以使用vanilla
split
方法通过从左或右拆分来安排此操作。然后使用pandasDataFrame
构造函数定义数据帧:
txt = "Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico"
left_greedy = True
if left_greedy:
txt = txt.split(' ', 2)
else:
txt = txt.rsplit(' ', 2)
df = pd.DataFrame(np.column_stack(txt), columns=['Col1', 'Col2', 'Col3'])
输出
Col1 Col2 Col3
0 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico
如果我们将
left\u greedy=False
设置为:
Col1 Col2 Col3
0 Mar-13-19:04:13 [error] File does not exist: /var/www/favicon.ico
如果您包括示例行在数据帧中的外观,则会更容易。您的解释可能不是每个人都清楚,但以dataframe的形式直观地说明了这一点。我刚刚更新了OP,以准确地显示它们在本例中的工作方式。谢谢@Erfan+1谢谢,但这只适用于一行,对吗?对于一个包含多行的文本文件,您将如何执行此操作?编辑了我的方法,该方法适用于我的示例文本文件@AmelioVazquez Reina