Python 使用重复块解析文件中的垂直文本
解析以下文件的最佳方法是什么?这些块重复多次 预期结果输出到CSV文件,如下所示:Python 使用重复块解析文件中的垂直文本,python,Python,解析以下文件的最佳方法是什么?这些块重复多次 预期结果输出到CSV文件,如下所示: {地点:区域1,主机:ABCD,区域:44…} 我尝试了下面的代码,但它只迭代第一个块,然后完成 with open('/tmp/t2.txt', 'r') as input_data: for line in input_data: if re.findall('(.*_RV)\n',line): myDict={} myDict['HOST'] = line[6
{地点:区域1,主机:ABCD,区域:44…}
我尝试了下面的代码,但它只迭代第一个块,然后完成
with open('/tmp/t2.txt', 'r') as input_data:
for line in input_data:
if re.findall('(.*_RV)\n',line):
myDict={}
myDict['HOST'] = line[6:]
continue
elif re.findall('Interface(.*)\n',line):
myDict['INTF'] = line[6:]
elif len(line.strip()) == 0:
print(myDict)
文本文件在下面
实例区域-1:
ABCD_RV
接口:fastethernet01/01
上次状态更改:0h54m44s之前
系统识别码:01441
讲:IPv4
拓扑:
ipv4单播
SAPA:点对点
地区地址:
441
IPv4地址:
1.1.1.1
EFGH_RV
接口:fastethernet01/01
上次状态更改:0h54m44s之前
系统识别码:01442
讲:IPv4
拓扑:
ipv4单播
SAPA:点对点
地区地址:
442
IPv4地址:
1.1.1.2
实例区域-2:
IJKL_RV
接口:fastethernet01/01
上次状态更改:0h54m44s之前
系统识别码:01443
讲:IPv4
拓扑:
ipv4单播
SAPA:点对点
地区地址:
443
IPv4地址:
1.1.1.3
这对我来说很有效,但并不漂亮:
text=input_data
text=text.rstrip(' ').rstrip('\n').strip('\n')
#first I get ready to create a csv by replacing the headers for the data
text=text.replace('Instance REGION-1:',',')
text=text.replace('Instance REGION-2:',',')
text=text.replace('Interface:',',')
text=text.replace('Last state change:',',')
text=text.replace('Sysid:',',')
text=text.replace('Speaks:',',')
text=text.replace('Topologies:',',')
text=text.replace('SAPA:',',')
text=text.replace('Area Address(es):',',')
text=text.replace('IPv4 Address(es):',',')
#now I strip out the leading whitespace, cuz it messes up the split on '\n\n'
lines=[x.lstrip(' ') for x in text.split('\n')]
clean_text=''
#now that the leading whitespace is gone I recreate the text file
for line in lines:
clean_text+=line+'\n'
#Now split the data into groups based on single entries
entries=clean_text.split('\n\n')
#create one liners out of the entries so they can be split like csv
entry_lines=[x.replace('\n',' ') for x in entries]
#create a dataframe to hold the data for each line
df=pd.DataFrame(columns=['Instance REGION','Interface',
'Last state change','Sysid','Speaks',
'Topologies','SAPA','Area Address(es)',
'IPv4 Address(es)']).T
#now the meat and potatoes
count=0
for line in entry_lines:
data=line[1:].split(',') #split like a csv on commas
data=[x.lstrip(' ').rstrip(' ') for x in data] #get rid of extra leading/trailing whitespace
df[count]=data #create an entry for each split
count+=1 #incriment the count
df=df.T #transpose back to normal so it doesn't look weird
输出对我来说是这样的
编辑:还有,既然你在这里有各种各样的答案,我就测试一下我的答案。它是由方程y=100.97e^(0.0003x)描述的温和指数
这是我的timeit结果
Entries Milliseconds
18 49
270 106
1620 394
178420 28400
或者,如果您更喜欢难看的正则表达式路径:
import re
region_re = re.compile("^Instance\s+([^:]+):.*")
host_re = re.compile("^\s+(.*?)_RV.*")
interface_re = re.compile("^\s+Interface:\s+(.*?)\s+")
other_re = re.compile("^\s+([^\s]+).*?:\s+([^\s]*){0,1}")
myDict = {}
extra = None
with open('/tmp/t2.txt', 'r') as input_data:
for line in input_data:
if extra: # value on next line from key
myDict[extra] = line.strip()
extra = None
continue
region = region_re.match(line)
if region:
if len(myDict) > 1:
print(myDict)
myDict = {'Place': region.group(1)}
continue
host = host_re.match(line)
if host:
if len(myDict) > 1:
print(myDict)
myDict = {'Place': myDict['Place'], 'Host': host.group(1)}
continue
interface = interface_re.match(line)
if interface:
myDict['INTF'] = interface.group(1)
continue
other = other_re.match(line)
if other:
groups = other.groups()
if groups[1]:
myDict[groups[0]] = groups[1]
else:
extra = groups[0]
# dump out final one
if len(myDict) > 1:
print(myDict)
输出:
{'Place': 'REGION-1', 'Host': 'ABCD', 'INTF': 'fastethernet01/01', 'Last': '0h54m44s', 'Sysid': '01441', 'Speaks': 'IPv4', 'Topologies': 'ipv4-unicast', 'SAPA': 'point-to-point', 'Area': '441', 'IPv4': '1.1.1.1'}
{'Place': 'REGION-1', 'Host': 'EFGH', 'INTF': 'fastethernet01/01', 'Last': '0h54m44s', 'Sysid': '01442', 'Speaks': 'IPv4', 'Topologies': 'ipv4-unicast', 'SAPA': 'point-to-point', 'Area': '442', 'IPv4': '1.1.1.2'}
{'Place': 'REGION-2', 'Host': 'IJKL', 'INTF': 'fastethernet01/01', 'Last': '0h54m44s', 'Sysid': '01443', 'Speaks': 'IPv4', 'Topologies': 'ipv4-unicast', 'SAPA': 'point-to-point', 'Area': '443', 'IPv4': '1.1.1.3'}
这不需要太多的正则表达式,可以进行更优化。希望有帮助
import re
import pandas as pd
from collections import defaultdict
_level_1 = re.compile(r'instance region.*', re.IGNORECASE)
with open('stack_formatting.txt') as f:
data = f.readlines()
"""
Format data so that it could be split easily
"""
data_blocks = defaultdict(lambda: defaultdict(str))
header = None
instance = None
for line in data:
line = line.strip()
if _level_1.match(line):
header = line
else:
if "_RV" in line:
instance = line
elif not line.endswith(":"):
data_blocks[header][instance] += line + ";"
else:
data_blocks[header][instance] += line
def parse_text(data_blocks):
"""
Generate a dict which could be converted easily to a pandas dataframe
:param data_blocks: splittable data
:return: dict with row values for every column
"""
final_data = defaultdict(list)
for key1 in data_blocks.keys():
for key2 in data_blocks.get(key1):
final_data['instance'].append(key1)
final_data['sub_instance'].append(key2)
for items in data_blocks[key1][key2].split(";"):
print(items)
if items.isspace() or len(items) == 0:
continue
a,b = re.split(r':\s*', items)
final_data[a].append(b)
return final_data
print(pd.DataFrame(parse_text(data_blocks)))
欢迎来到SO!你能澄清你的输出格式吗?看起来字段正在更改或被忽略,但代码片段太简短,无法真正理解您要做的事情。请发布完整版本。是的,并不是所有的都在key:value布局中…有些在第一行有key,下一行有value…例如IPv4地址:和它的值在下一行1.1.1.3thanks,总是像没有regx一样!。。。我来试试看这工作怎么样?当我开始增加条目数时,我收到一个错误。我要比较速度,但我不能让它工作。你得到的错误是什么?它是一个基本代码,在所有实例中都存在所有字段时运行。您可以自定义它以使其适用于变体。请您解释一下这意味着什么:data_blocks=defaultdict(lambda:defaultdict(str))我想要一个实例和子实例的字典来存储相关数据。你可以在这里了解更多。