Python 读取csv文件中的n个表以分离数据帧
我有一个单独的.csv文件,其中有四个表,每个表都是四家西南航空公司2001-1986年间的不同财务报表。我知道我可以将每个表分成单独的文件,但它们最初是作为一个文件下载的 我想将每个表读取到它自己的数据框架中进行分析。以下是数据的子集:Python 读取csv文件中的n个表以分离数据帧,python,pandas,file,csv,dataframe,Python,Pandas,File,Csv,Dataframe,我有一个单独的.csv文件,其中有四个表,每个表都是四家西南航空公司2001-1986年间的不同财务报表。我知道我可以将每个表分成单独的文件,但它们最初是作为一个文件下载的 我想将每个表读取到它自己的数据框架中进行分析。以下是数据的子集: Balance Sheet Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998 Cash & cash equivalent
Balance Sheet
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Cash & cash equivalents 2279861 522995 418819 378511
Short-term investments - - - -
Accounts & other receivables 71283 138070 73448 88799
Inventories of parts... 70561 80564 65152 50035
Income Statement
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Passenger revenues 5378702 5467965 4499360 3963781
Freight revenues 91270 110742 102990 98500
Charter & other - - - -
Special revenue adjustment - - - -
Statement of Retained Earnings
Report Date 12/31/2001 12/31/2000 12/31/1999 12/31/1998
Previous ret earn... 2902007 2385854 2044975 1632115
Cumulative effect of.. - - - -
Three-for-two stock split 117885 - 78076 -
Issuance of common.. 52753 75952 45134 10184
各表各有17列,第一列为行项目说明,但行数不同,即资产负债表为100行,而现金流量表为65行
我所做的
我看到过类似的帖子,注意到使用nrows和skiprows。我使用skiprows读取整个文件,然后通过索引创建单个财务报表
我正在寻找关于以更好的Python风格和最佳实践为每个表创建数据框架的评论和建设性批评 以下是我的解决方案:
我的假设是,每个报表都以一个指标(“资产负债表”、“损益表”、“留存收益表”)开始,我们可以基于该指标拆分该表以获得单独的数据帧。这是以下代码所基于的前提。让我知道这是否是一个有缺陷的假设
import pandas as pd
import numpy as np
#i copied your data above and created a csv with it
df = pd.read_csv('csvtable_stackoverflow',header=None)
0
0 Balance Sheet
1 Report Date 12/31/2001 12/31/...
2 Cash & cash equivalents 2279861 522995...
3 Short-term investments - - ...
4 Accounts & other receivables 71283 138070...
5 Inventories of parts... 70561 80564...
6 Income Statement
7 Report Date 12/31/2001 12/31/...
8 Passenger revenues 5378702 546796...
9 Freight revenues 91270 110742...
10 Charter & other - - ...
11 Special revenue adjustment - - ...
12 Statement of Retained Earnings
13 Report Date 12/31/2001 12/31/2...
14 Previous ret earn... 2902007 2385854...
15 Cumulative effect of.. - - ...
16 Three-for-two stock split 117885 - 78076 -
17 Issuance of common.. 52753 75952...
下面的代码只是使用numpy select筛选出包含哪些行
资产负债表或损益表或现金流
下面的下一个代码创建一列,指示图纸类型,将“0”转换为null,然后填充
df = (df.assign(sheet_type = np.select(condlist,choicelist))
.assign(sheet_type = lambda x: x.sheet_type.replace('0',np.nan))
.fillna(method='ffill')
)
最后一步是拉出各个数据帧
df_bal_sheet = df.copy().query('sheet_type=="Balance Sheet"')
df_income_sheet = df.copy().query('sheet_type=="Income Statement"')
df_cash_flow = df.copy().query('sheet_type=="Statement of Retained Earnings"')
df_bal_sheet :
0 sheet_type
0 Balance Sheet Balance Sheet
1 Report Date 12/31/2001 12/31/... Balance Sheet
2 Cash & cash equivalents 2279861 522995... Balance Sheet
3 Short-term investments - - ... Balance Sheet
4 Accounts & other receivables 71283 138070... Balance Sheet
5 Inventories of parts... 70561 80564... Balance Sheet
df_income_sheet :
0 sheet_type
6 Income Statement Income Statement
7 Report Date 12/31/2001 12/31/... Income Statement
8 Passenger revenues 5378702 546796... Income Statement
9 Freight revenues 91270 110742... Income Statement
10 Charter & other - - ... Income Statement
11 Special revenue adjustment - - ... Income Statement
df_cash_flow:
0 sheet_type
12 Statement of Retained Earnings Statement of Retained Earnings
13 Report Date 12/31/2001 12/31/2... Statement of Retained Earnings
14 Previous ret earn... 2902007 2385854... Statement of Retained Earnings
15 Cumulative effect of.. - - ... Statement of Retained Earnings
16 Three-for-two stock split 117885 - 78076 - Statement of Retained Earnings
17 Issuance of common.. 52753 75952... Statement of Retained Earnings
您可以通过修复列名和删除不需要的行来执行进一步的操作 如果远远超出了
read\u csv
的能力,你想做什么。如果您输入的文件结构可以建模为:
重复:
数据帧名称
标题行
重复:
数据线
空行或文件结尾
在IMHO中,最简单的方法是逐行手动解析数据帧,为每个数据帧提供一个临时csv文件,然后加载数据帧。代码可以是:
df = {} # dictionary of dataframes
def process(tmp, df_name):
'''Process the temporary file corresponding to one dataframe'''
# print("Process", df_name, tmp.name) # uncomment for debugging
if tmp is not None:
tmp.close()
df[df_name] = pd.read_csv(tmp.name)
os.remove(tmp.name) # do not forget to remove the temp file
with open('LUV.csv') as file:
df_name = "NONAME" # should never be in resulting dict...
tmp = None
for line in file:
# print(line) # uncomment for debugging
if len(line.strip()) == 0: # close temp file on empty line
process(tmp, df_name) # and process it
tmp = None
elif tmp is None: # a new part: store the name
df_name = line.strip()
state = 1
tmp = tempfile.NamedTemporaryFile("w", delete=False)
else:
tmp.write(line) # just feed the temp file
# process the last part if no empty line was present...
process(tmp, df_name)
这并不是很有效,因为每一行都被写入一个临时文件,然后再次读取,但它简单而健壮
一个可能的改进是最初使用csv模块解析部件(可以在熊猫想要文件时解析流)。缺点是csv模块只解析为字符串,因此无法自动转换为熊猫的数量。我的意见是,只有当文件较大且必须重复完整操作时,才值得这样做。您已经知道每个表的起始行号了吗?csv文件中的分隔符是什么?您可以将其作为原始文本发布吗?您可以创建一个函数,将此文件拆分为单独的文件,然后您可以正常读取它们。似乎您可以使用空行来识别表的结尾。@AMC是的,我知道每个行的起始行号。这不是一种标准格式,每次我下载不同公司的财务报告时,格式可能会有所不同。
df_bal_sheet = df.copy().query('sheet_type=="Balance Sheet"')
df_income_sheet = df.copy().query('sheet_type=="Income Statement"')
df_cash_flow = df.copy().query('sheet_type=="Statement of Retained Earnings"')
df_bal_sheet :
0 sheet_type
0 Balance Sheet Balance Sheet
1 Report Date 12/31/2001 12/31/... Balance Sheet
2 Cash & cash equivalents 2279861 522995... Balance Sheet
3 Short-term investments - - ... Balance Sheet
4 Accounts & other receivables 71283 138070... Balance Sheet
5 Inventories of parts... 70561 80564... Balance Sheet
df_income_sheet :
0 sheet_type
6 Income Statement Income Statement
7 Report Date 12/31/2001 12/31/... Income Statement
8 Passenger revenues 5378702 546796... Income Statement
9 Freight revenues 91270 110742... Income Statement
10 Charter & other - - ... Income Statement
11 Special revenue adjustment - - ... Income Statement
df_cash_flow:
0 sheet_type
12 Statement of Retained Earnings Statement of Retained Earnings
13 Report Date 12/31/2001 12/31/2... Statement of Retained Earnings
14 Previous ret earn... 2902007 2385854... Statement of Retained Earnings
15 Cumulative effect of.. - - ... Statement of Retained Earnings
16 Three-for-two stock split 117885 - 78076 - Statement of Retained Earnings
17 Issuance of common.. 52753 75952... Statement of Retained Earnings
df = {} # dictionary of dataframes
def process(tmp, df_name):
'''Process the temporary file corresponding to one dataframe'''
# print("Process", df_name, tmp.name) # uncomment for debugging
if tmp is not None:
tmp.close()
df[df_name] = pd.read_csv(tmp.name)
os.remove(tmp.name) # do not forget to remove the temp file
with open('LUV.csv') as file:
df_name = "NONAME" # should never be in resulting dict...
tmp = None
for line in file:
# print(line) # uncomment for debugging
if len(line.strip()) == 0: # close temp file on empty line
process(tmp, df_name) # and process it
tmp = None
elif tmp is None: # a new part: store the name
df_name = line.strip()
state = 1
tmp = tempfile.NamedTemporaryFile("w", delete=False)
else:
tmp.write(line) # just feed the temp file
# process the last part if no empty line was present...
process(tmp, df_name)