Python 读取csv文件中的n个表以分离数据帧_Python_Pandas_File_Csv_Dataframe

Python 读取csv文件中的n个表以分离数据帧

python pandas file csv dataframe

Python 读取csv文件中的n个表以分离数据帧,python,pandas,file,csv,dataframe,Python,Pandas,File,Csv,Dataframe,我有一个单独的.csv文件，其中有四个表，每个表都是四家西南航空公司2001-1986年间的不同财务报表。我知道我可以将每个表分成单独的文件，但它们最初是作为一个文件下载的我想将每个表读取到它自己的数据框架中进行分析。以下是数据的子集： Balance Sheet Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998 Cash & cash equivalent

我有一个单独的.csv文件，其中有四个表，每个表都是四家西南航空公司2001-1986年间的不同财务报表。我知道我可以将每个表分成单独的文件，但它们最初是作为一个文件下载的

我想将每个表读取到它自己的数据框架中进行分析。以下是数据的子集：

Balance Sheet               
Report Date               12/31/2001    12/31/2000  12/31/1999  12/31/1998
Cash & cash equivalents   2279861       522995      418819      378511
Short-term investments    -             -           -            -
Accounts & other receivables    71283   138070      73448       88799
Inventories of parts...   70561          80564        65152     50035

Income Statement                
Report Date               12/31/2001    12/31/2000  12/31/1999  12/31/1998
Passenger revenues        5378702       5467965     4499360     3963781
Freight revenues          91270         110742      102990      98500
Charter & other           -              -           -           -
Special revenue adjustment  -            -           -           -

Statement of Retained Earnings              
Report Date              12/31/2001    12/31/2000   12/31/1999  12/31/1998
Previous ret earn...     2902007       2385854      2044975     1632115
Cumulative effect of..    -              -            -          -
Three-for-two stock split   117885  -   78076   -
Issuance of common..     52753           75952       45134       10184

各表各有17列，第一列为行项目说明，但行数不同，即资产负债表为100行，而现金流量表为65行

我所做的我看到过类似的帖子，注意到使用nrows和skiprows。我使用skiprows读取整个文件，然后通过索引创建单个财务报表

我正在寻找关于以更好的Python风格和最佳实践为每个表创建数据框架的评论和建设性批评

以下是我的解决方案：我的假设是，每个报表都以一个指标（“资产负债表”、“损益表”、“留存收益表”）开始，我们可以基于该指标拆分该表以获得单独的数据帧。这是以下代码所基于的前提。让我知道这是否是一个有缺陷的假设

import pandas as pd
import numpy as np

#i copied your data above and created a csv with it

df = pd.read_csv('csvtable_stackoverflow',header=None)

        0
0   Balance Sheet
1   Report Date 12/31/2001 12/31/...
2   Cash & cash equivalents 2279861 522995...
3   Short-term investments - - ...
4   Accounts & other receivables 71283 138070...
5   Inventories of parts... 70561 80564...
6   Income Statement
7   Report Date 12/31/2001 12/31/...
8   Passenger revenues 5378702 546796...
9   Freight revenues 91270 110742...
10  Charter & other - - ...
11  Special revenue adjustment - - ...
12  Statement of Retained Earnings
13  Report Date 12/31/2001 12/31/2...
14  Previous ret earn... 2902007 2385854...
15  Cumulative effect of.. - - ...
16  Three-for-two stock split 117885 - 78076 -
17  Issuance of common.. 52753 75952...

下面的代码只是使用numpy select筛选出包含哪些行资产负债表或损益表或现金流

下面的下一个代码创建一列，指示图纸类型，将“0”转换为null，然后填充

df = (df.assign(sheet_type = np.select(condlist,choicelist))
      .assign(sheet_type = lambda x: x.sheet_type.replace('0',np.nan))
      .fillna(method='ffill')
      )

最后一步是拉出各个数据帧

df_bal_sheet = df.copy().query('sheet_type=="Balance Sheet"')
df_income_sheet = df.copy().query('sheet_type=="Income Statement"')
df_cash_flow = df.copy().query('sheet_type=="Statement of Retained Earnings"')

df_bal_sheet :     
         0                                            sheet_type
0   Balance Sheet                                    Balance Sheet
1   Report Date 12/31/2001 12/31/...                 Balance Sheet
2   Cash & cash equivalents 2279861 522995...        Balance Sheet
3   Short-term investments - - ...                   Balance Sheet
4   Accounts & other receivables 71283 138070...     Balance Sheet
5   Inventories of parts... 70561 80564...           Balance Sheet

df_income_sheet : 
           0                                     sheet_type
6   Income Statement                           Income Statement
7   Report Date 12/31/2001 12/31/...           Income Statement
8   Passenger revenues 5378702 546796...       Income Statement
9   Freight revenues 91270 110742...           Income Statement
10  Charter & other - - ...                    Income Statement
11  Special revenue adjustment - - ...         Income Statement

df_cash_flow:
              0                                         sheet_type
12  Statement of Retained Earnings              Statement of Retained Earnings
13  Report Date 12/31/2001 12/31/2...           Statement of Retained Earnings
14  Previous ret earn... 2902007 2385854...     Statement of Retained Earnings
15  Cumulative effect of.. - - ...              Statement of Retained Earnings
16  Three-for-two stock split 117885 - 78076 -  Statement of Retained Earnings
17  Issuance of common.. 52753 75952...         Statement of Retained Earnings

您可以通过修复列名和删除不需要的行来执行进一步的操作

如果远远超出了

read\u csv

的能力，你想做什么。如果您输入的文件结构可以建模为：

重复：
数据帧名称
标题行
重复：
数据线
空行或文件结尾

在IMHO中，最简单的方法是逐行手动解析数据帧，为每个数据帧提供一个临时csv文件，然后加载数据帧。代码可以是：

df = {}        # dictionary of dataframes

def process(tmp, df_name):
'''Process the temporary file corresponding to one dataframe'''                
    # print("Process", df_name, tmp.name)  # uncomment for debugging
    if tmp is not None:
        tmp.close()
        df[df_name] = pd.read_csv(tmp.name)
        os.remove(tmp.name)                # do not forget to remove the temp file

with open('LUV.csv') as file:
    df_name = "NONAME"                     # should never be in resulting dict...
    tmp = None
    for line in file:
        # print(line)                      # uncomment for debugging
        if len(line.strip()) == 0:         # close temp file on empty line
            process(tmp, df_name)          # and process it
            tmp = None
        elif tmp is None:                  # a new part: store the name
            df_name = line.strip()
            state = 1
            tmp = tempfile.NamedTemporaryFile("w", delete=False)
        else:
            tmp.write(line)                # just feed the temp file

    # process the last part if no empty line was present...
    process(tmp, df_name)

这并不是很有效，因为每一行都被写入一个临时文件，然后再次读取，但它简单而健壮

一个可能的改进是最初使用csv模块解析部件（可以在熊猫想要文件时解析流）。缺点是csv模块只解析为字符串，因此无法自动转换为熊猫的数量。我的意见是，只有当文件较大且必须重复完整操作时，才值得这样做。

您已经知道每个表的起始行号了吗？csv文件中的分隔符是什么？您可以将其作为原始文本发布吗？您可以创建一个函数，将此文件拆分为单独的文件，然后您可以正常读取它们。似乎您可以使用空行来识别表的结尾。@AMC是的，我知道每个行的起始行号。这不是一种标准格式，每次我下载不同公司的财务报告时，格式可能会有所不同。

df_bal_sheet = df.copy().query('sheet_type=="Balance Sheet"')
df_income_sheet = df.copy().query('sheet_type=="Income Statement"')
df_cash_flow = df.copy().query('sheet_type=="Statement of Retained Earnings"')

df_bal_sheet :     
         0                                            sheet_type
0   Balance Sheet                                    Balance Sheet
1   Report Date 12/31/2001 12/31/...                 Balance Sheet
2   Cash & cash equivalents 2279861 522995...        Balance Sheet
3   Short-term investments - - ...                   Balance Sheet
4   Accounts & other receivables 71283 138070...     Balance Sheet
5   Inventories of parts... 70561 80564...           Balance Sheet

df_income_sheet : 
           0                                     sheet_type
6   Income Statement                           Income Statement
7   Report Date 12/31/2001 12/31/...           Income Statement
8   Passenger revenues 5378702 546796...       Income Statement
9   Freight revenues 91270 110742...           Income Statement
10  Charter & other - - ...                    Income Statement
11  Special revenue adjustment - - ...         Income Statement

df_cash_flow:
              0                                         sheet_type
12  Statement of Retained Earnings              Statement of Retained Earnings
13  Report Date 12/31/2001 12/31/2...           Statement of Retained Earnings
14  Previous ret earn... 2902007 2385854...     Statement of Retained Earnings
15  Cumulative effect of.. - - ...              Statement of Retained Earnings
16  Three-for-two stock split 117885 - 78076 -  Statement of Retained Earnings
17  Issuance of common.. 52753 75952...         Statement of Retained Earnings

df = {}        # dictionary of dataframes

def process(tmp, df_name):
'''Process the temporary file corresponding to one dataframe'''                
    # print("Process", df_name, tmp.name)  # uncomment for debugging
    if tmp is not None:
        tmp.close()
        df[df_name] = pd.read_csv(tmp.name)
        os.remove(tmp.name)                # do not forget to remove the temp file

with open('LUV.csv') as file:
    df_name = "NONAME"                     # should never be in resulting dict...
    tmp = None
    for line in file:
        # print(line)                      # uncomment for debugging
        if len(line.strip()) == 0:         # close temp file on empty line
            process(tmp, df_name)          # and process it
            tmp = None
        elif tmp is None:                  # a new part: store the name
            df_name = line.strip()
            state = 1
            tmp = tempfile.NamedTemporaryFile("w", delete=False)
        else:
            tmp.write(line)                # just feed the temp file

    # process the last part if no empty line was present...
    process(tmp, df_name)