Python Odd.txt报告到数据框中
我有一个.txt报告,其中有帐号、地址和信用额度,报告格式为.txt 它有分页符,但通常是这样的Python Odd.txt报告到数据框中,python,pandas,csv,report,analysis,Python,Pandas,Csv,Report,Analysis,我有一个.txt报告,其中有帐号、地址和信用额度,报告格式为.txt 它有分页符,但通常是这样的 客户地址信用额度 A001温迪20000 大街123号 城市、州 邮政编码 我希望我的数据框看起来像这样 客户地址信用额度 A001温迪大街123号,纽约市,声明20000 下面是我正在处理的示例csv的链接 我试着跳过几行,但没用 好的,这种格式没有什么困难,但它不是csv。因此,既不能使用Python csv模块,也不能使用pandasread\u csv。我们将不得不手工解析它 最复杂的决策
客户地址信用额度
A001温迪20000
大街123号
城市、州
邮政编码
我希望我的数据框看起来像这样
客户地址信用额度
A001温迪大街123号,纽约市,声明20000
下面是我正在处理的示例csv的链接
我试着跳过几行,但没用 好的,这种格式没有什么困难,但它不是csv。因此,既不能使用Python csv模块,也不能使用pandasread\u csv
。我们将不得不手工解析它
最复杂的决策是确定每个客户的第一行和最后一行。我将使用:
- 第一行以仅包含大写字母和数字的单词开头,以仅包含数字且长度超过100个字符的单词结尾
- 块在第一个空行处结束
完成后:
- 第一行包含帐号、名称、地址的第一行和帐户限制
- 后续行包含地址的其他行
- 字段位于固定位置:[5,19],[23,49],[57,77],[90,_线的末端_)
在Python中,In将给出:
fieldpos = [(5,19), (23,49), (57,77), (90, -1)] # position of fields in the initial line
inblock = False # we do not start inside a block
account_pat = re.compile(r'[A-Z]+\d+\s*$') # regex patterns are compiled once for performance
limit_pat = re.compile(r'\s*\d+$')
data = [] # a list for the accounts
with open(file) as fd:
for line in fd:
if not inblock:
if (len(line) > 100):
row = [line[f[0]:f[1]].strip() for f in fieldpos]
if account_pat.match(row[0]) and limit_pat.match(row[-1]):
inblock = True
data.append(row)
else:
line = line.strip()
if len(line) > 0:
row[2] += ', ' + line
else:
inblock = False
# we can now build a dataframe
df = pd.DataFrame(data, columns=['Account Number', 'Name', 'Address', 'Credit Limit'])
它最后给出:
Account Number Name Address Credit Limit
0 A001 Dan Ackroyd Audenshaw, 125 New Street, Montreal, Quebec, H... 20000
1 A123 Mike Atsil The Vetinary House, 123 Dog Row, Thunder Bay, ... 20000
2 A128 Ivan Aker The Old House, Ottawa, Ontario, P1D 8D4 10000
3 B001 Kim Basinger Mesh House, Fish Street, Rouyn, Quebec, J5V 2A9 12000
4 B002 Richard Burton Eagle Castle, Leafy Lane, Sudbury, Ontario, L3... 9000
5 B004 Jeff Bridges Arrow Road North, Lakeside, Kenora, Ontario, N... 20000
6 B008 Denise Bent The Dance Studio, Covent Garden, Montreal, Que... 20000
7 B010 Carter Bout Removals Close, No Fixed Abode Road, Toronto, ... 20000
8 B022 Ronnie Biggs Gotaway Cottage, Thunder Bay, Ontario, K3A 6F3 5000
9 C001 Tom Cruise The Firm, Gunnersbury, Waskaganish, Quebec, G1... 25000
10 C003 John Candy The Sweet Shop, High Street, Trois Rivieres, Q... 15000
好的,这种格式没有什么困难,但它不是csv。因此,既不能使用Python csv模块,也不能使用pandasread\u csv
。我们必须手动解析它
最复杂的决策是确定每个客户的第一行和最后一行。我将使用:
- 第一行以仅包含大写字母和数字的单词开头,以仅包含数字且长度超过100个字符的单词结尾
- 块在第一个空行处结束
完成后:
- 第一行包含帐号、名称、地址的第一行和帐户限制
- 后续行包含地址的其他行
- 字段位于固定位置:[5,19],[23,49],[57,77],[90,_线的末端_)
在Python中,In将给出:
fieldpos = [(5,19), (23,49), (57,77), (90, -1)] # position of fields in the initial line
inblock = False # we do not start inside a block
account_pat = re.compile(r'[A-Z]+\d+\s*$') # regex patterns are compiled once for performance
limit_pat = re.compile(r'\s*\d+$')
data = [] # a list for the accounts
with open(file) as fd:
for line in fd:
if not inblock:
if (len(line) > 100):
row = [line[f[0]:f[1]].strip() for f in fieldpos]
if account_pat.match(row[0]) and limit_pat.match(row[-1]):
inblock = True
data.append(row)
else:
line = line.strip()
if len(line) > 0:
row[2] += ', ' + line
else:
inblock = False
# we can now build a dataframe
df = pd.DataFrame(data, columns=['Account Number', 'Name', 'Address', 'Credit Limit'])
它最后给出:
Account Number Name Address Credit Limit
0 A001 Dan Ackroyd Audenshaw, 125 New Street, Montreal, Quebec, H... 20000
1 A123 Mike Atsil The Vetinary House, 123 Dog Row, Thunder Bay, ... 20000
2 A128 Ivan Aker The Old House, Ottawa, Ontario, P1D 8D4 10000
3 B001 Kim Basinger Mesh House, Fish Street, Rouyn, Quebec, J5V 2A9 12000
4 B002 Richard Burton Eagle Castle, Leafy Lane, Sudbury, Ontario, L3... 9000
5 B004 Jeff Bridges Arrow Road North, Lakeside, Kenora, Ontario, N... 20000
6 B008 Denise Bent The Dance Studio, Covent Garden, Montreal, Que... 20000
7 B010 Carter Bout Removals Close, No Fixed Abode Road, Toronto, ... 20000
8 B022 Ronnie Biggs Gotaway Cottage, Thunder Bay, Ontario, K3A 6F3 5000
9 C001 Tom Cruise The Firm, Gunnersbury, Waskaganish, Quebec, G1... 25000
10 C003 John Candy The Sweet Shop, High Street, Trois Rivieres, Q... 15000