Python Odd.txt报告到数据框中

Python Odd.txt报告到数据框中,python,pandas,csv,report,analysis,Python,Pandas,Csv,Report,Analysis,我有一个.txt报告,其中有帐号、地址和信用额度,报告格式为.txt 它有分页符,但通常是这样的 客户地址信用额度 A001温迪20000 大街123号 城市、州 邮政编码 我希望我的数据框看起来像这样 客户地址信用额度 A001温迪大街123号,纽约市,声明20000 下面是我正在处理的示例csv的链接 我试着跳过几行,但没用 好的,这种格式没有什么困难,但它不是csv。因此,既不能使用Python csv模块,也不能使用pandasread\u csv。我们将不得不手工解析它 最复杂的决策

我有一个.txt报告,其中有帐号、地址和信用额度,报告格式为.txt

它有分页符,但通常是这样的

客户地址信用额度
A001温迪20000
大街123号
城市、州
邮政编码

我希望我的数据框看起来像这样

客户地址信用额度
A001温迪大街123号,纽约市,声明20000

下面是我正在处理的示例csv的链接


我试着跳过几行,但没用

好的,这种格式没有什么困难,但它不是csv。因此,既不能使用Python csv模块,也不能使用pandas
read\u csv
。我们将不得不手工解析它

最复杂的决策是确定每个客户的第一行和最后一行。我将使用:

  • 第一行以仅包含大写字母和数字的单词开头,以仅包含数字且长度超过100个字符的单词结尾
  • 块在第一个空行处结束
完成后:

  • 第一行包含帐号、名称、地址的第一行和帐户限制
  • 后续行包含地址的其他行
  • 字段位于固定位置:[5,19],[23,49],[57,77],[90,_线的末端_)
在Python中,In将给出:

fieldpos = [(5,19), (23,49), (57,77), (90, -1)]  # position of fields in the initial line 

inblock = False                                  # we do not start inside a block

account_pat = re.compile(r'[A-Z]+\d+\s*$')       # regex patterns are compiled once for performance
limit_pat = re.compile(r'\s*\d+$')

data = []                                        # a list for the accounts

with open(file) as fd:
    for line in fd:
        if not inblock:
            if (len(line) > 100):
                row = [line[f[0]:f[1]].strip() for f in fieldpos]
                if account_pat.match(row[0]) and limit_pat.match(row[-1]):
                    inblock = True
                    data.append(row)
        else:
            line = line.strip()
            if len(line) > 0:
                row[2] += ', ' + line
            else:
                inblock = False

# we can now build a dataframe
df = pd.DataFrame(data, columns=['Account Number', 'Name', 'Address', 'Credit Limit'])
它最后给出:

   Account Number                 Name                                            Address Credit Limit
0            A001          Dan Ackroyd  Audenshaw, 125 New Street, Montreal, Quebec, H...        20000
1            A123           Mike Atsil  The Vetinary House, 123 Dog Row, Thunder Bay, ...        20000
2            A128            Ivan Aker            The Old House, Ottawa, Ontario, P1D 8D4        10000
3            B001         Kim Basinger    Mesh House, Fish Street, Rouyn, Quebec, J5V 2A9        12000
4            B002       Richard Burton  Eagle Castle, Leafy Lane, Sudbury, Ontario, L3...         9000
5            B004         Jeff Bridges  Arrow Road North, Lakeside, Kenora, Ontario, N...        20000
6            B008          Denise Bent  The Dance Studio, Covent Garden, Montreal, Que...        20000
7            B010          Carter Bout  Removals Close, No Fixed Abode Road, Toronto, ...        20000
8            B022         Ronnie Biggs     Gotaway Cottage, Thunder Bay, Ontario, K3A 6F3         5000
9            C001           Tom Cruise  The Firm, Gunnersbury, Waskaganish, Quebec, G1...        25000
10           C003           John Candy  The Sweet Shop, High Street, Trois Rivieres, Q...        15000

好的,这种格式没有什么困难,但它不是csv。因此,既不能使用Python csv模块,也不能使用pandas
read\u csv
。我们必须手动解析它

最复杂的决策是确定每个客户的第一行和最后一行。我将使用:

  • 第一行以仅包含大写字母和数字的单词开头,以仅包含数字且长度超过100个字符的单词结尾
  • 块在第一个空行处结束
完成后:

  • 第一行包含帐号、名称、地址的第一行和帐户限制
  • 后续行包含地址的其他行
  • 字段位于固定位置:[5,19],[23,49],[57,77],[90,_线的末端_)
在Python中,In将给出:

fieldpos = [(5,19), (23,49), (57,77), (90, -1)]  # position of fields in the initial line 

inblock = False                                  # we do not start inside a block

account_pat = re.compile(r'[A-Z]+\d+\s*$')       # regex patterns are compiled once for performance
limit_pat = re.compile(r'\s*\d+$')

data = []                                        # a list for the accounts

with open(file) as fd:
    for line in fd:
        if not inblock:
            if (len(line) > 100):
                row = [line[f[0]:f[1]].strip() for f in fieldpos]
                if account_pat.match(row[0]) and limit_pat.match(row[-1]):
                    inblock = True
                    data.append(row)
        else:
            line = line.strip()
            if len(line) > 0:
                row[2] += ', ' + line
            else:
                inblock = False

# we can now build a dataframe
df = pd.DataFrame(data, columns=['Account Number', 'Name', 'Address', 'Credit Limit'])
它最后给出:

   Account Number                 Name                                            Address Credit Limit
0            A001          Dan Ackroyd  Audenshaw, 125 New Street, Montreal, Quebec, H...        20000
1            A123           Mike Atsil  The Vetinary House, 123 Dog Row, Thunder Bay, ...        20000
2            A128            Ivan Aker            The Old House, Ottawa, Ontario, P1D 8D4        10000
3            B001         Kim Basinger    Mesh House, Fish Street, Rouyn, Quebec, J5V 2A9        12000
4            B002       Richard Burton  Eagle Castle, Leafy Lane, Sudbury, Ontario, L3...         9000
5            B004         Jeff Bridges  Arrow Road North, Lakeside, Kenora, Ontario, N...        20000
6            B008          Denise Bent  The Dance Studio, Covent Garden, Montreal, Que...        20000
7            B010          Carter Bout  Removals Close, No Fixed Abode Road, Toronto, ...        20000
8            B022         Ronnie Biggs     Gotaway Cottage, Thunder Bay, Ontario, K3A 6F3         5000
9            C001           Tom Cruise  The Firm, Gunnersbury, Waskaganish, Quebec, G1...        25000
10           C003           John Candy  The Sweet Shop, High Street, Trois Rivieres, Q...        15000