Python 重组数据帧
我有一个dataframe,目前看起来如下,有262800行和3列。我的数据帧当前如下所示:Python 重组数据帧,python,pandas,Python,Pandas,我有一个dataframe,目前看起来如下,有262800行和3列。我的数据帧当前如下所示: Currency Maturity value 0 GBP 0.08333333 4.709456 1 GBP 0.08333333 4.713099 2 GBP 0.08333333 4.707237 3 GBP 0.08333333 4.705043 4 G
Currency Maturity value
0 GBP 0.08333333 4.709456
1 GBP 0.08333333 4.713099
2 GBP 0.08333333 4.707237
3 GBP 0.08333333 4.705043
4 GBP 0.08333333 4.697150
5 GBP 0.08333333 4.710647
6 GBP 0.08333333 4.701150
7 GBP 0.08333333 4.694639
8 GBP 0.08333333 4.686111
9 GBP 0.08333333 4.714750
......
262770 GBP 25 2.432869
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel(url, sheetname=6)
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'Maturity'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'Maturity'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
idx = 0
values = ['GBP'] * len(combined_data.index)
combined_data.insert(idx, 'Currency', values)
#print combined_data.columns.values
#I had to do the melt
combined_data = pd.melt(combined_data,id_vars=['Currency'])#Arbitrarily melted on 'Currency' as for some reason when I do print combined_data.columns.values I see that 'Currency' corresponds to 0.08333333, etc.
print combined_data
我希望数据帧的格式如下。我已经为此采取了一些措施,包括在下面的代码中使用melt
,但出于某种原因,我去掉了Date
列,并生成了上面的数据框。我不确定如何取回日期列并获取以下数据框:
Maturity Date Currency Yield_pct
0 0.08333333 2005-01-04 GBP 4.709456
1 0.08333333 2005-01-05 GBP 4.713099
2 0.08333333 2005-01-06 GBP 4.707237
....
9 25 2005-01-04 GBP 2.432869
我的代码如下:
Currency Maturity value
0 GBP 0.08333333 4.709456
1 GBP 0.08333333 4.713099
2 GBP 0.08333333 4.707237
3 GBP 0.08333333 4.705043
4 GBP 0.08333333 4.697150
5 GBP 0.08333333 4.710647
6 GBP 0.08333333 4.701150
7 GBP 0.08333333 4.694639
8 GBP 0.08333333 4.686111
9 GBP 0.08333333 4.714750
......
262770 GBP 25 2.432869
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel(url, sheetname=6)
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'Maturity'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'Maturity'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
idx = 0
values = ['GBP'] * len(combined_data.index)
combined_data.insert(idx, 'Currency', values)
#print combined_data.columns.values
#I had to do the melt
combined_data = pd.melt(combined_data,id_vars=['Currency'])#Arbitrarily melted on 'Currency' as for some reason when I do print combined_data.columns.values I see that 'Currency' corresponds to 0.08333333, etc.
print combined_data
从pandas.io.excel导入读取excel
作为pd进口熊猫
将numpy作为np导入
url='1〕http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
#检查图纸编号,点:9/9,短端7/9
spot\u曲线=读取\u excel(url,sheetname=8)
短\末端\点\曲线=读取\ excel(url,sheetname=6)
#进行一些清洁,暂时保留NaN,因为产量曲线不建议向前填充NaN
spot_curve.columns=spot_curve.loc['years:']
spot_curve.columns.name=‘到期日’
有效指数=点曲线。指数[4:]
spot_曲线=spot_曲线。loc[有效指数]
#删除5年内的所有到期日,因为这些到期日在短期文件中重复
col_mask=点_曲线.columns.values>5
spot_curve=spot_curve.iloc[:,col_mask]
short_end_spot_curve.columns=short_end_spot_curve.loc['years:']
short\u end\u spot\u curve.columns.name='到期日'
有效指数=短终点点曲线。指数[4:]
短点曲线=短点曲线。loc[有效指数]
#合并这两个,时间索引是相同的
# ==============================================
组合数据=pd.concat([短端点点曲线,点曲线],轴=1,连接='outer')
#将到期日从短端排序到长端
组合数据。排序索引(轴=1,在位=真)
def过滤器功能(组):
返回组.isnull().sum(axis=1)您不能在melt
后面添加货币标识符吗
# Copy up to this stage
combined_data = combined_data.groupby(level=0).filter(filter_func)
# My code from here
combined_data.reset_index(inplace=True, drop=False)
combined_data.rename(columns={'index': 'Date'}, inplace=True)
# This line assumes you want datetime, ignore if you don't
combined_data['Date'] = pd.to_datetime(combined_data['Date'])
result = pd.melt(combined_data, id_vars=['Date'])
result['Currency'] = 'GBP'
result.head()的输出
在首先重置索引以包含货币后,尝试堆叠结果
cd = combined_data.reset_index().set_index(['index', 'Currency'])
cd_new = cd.stack()
>>> cd_new
index Currency Maturity
2005-01-04 GBP 0.083333 4.709456
0.166667 4.633861
0.250000 4.586271
0.333333 4.567017
0.416667 4.559578
0.500000 4.553227
0.583333 4.543976
0.666667 4.530881
0.750000 4.514742
0.833333 4.497187
0.916667 4.479690
1.000000 4.463105
1.083333 4.447843
1.166667 4.434076
1.250000 4.421868
...
2015-05-29 GBP 18.0 2.453898
18.5 2.475052
19.0 2.494679
19.5 2.512787
20.0 2.529393
20.5 2.544519
21.0 2.558198
21.5 2.570467
22.0 2.581368
22.5 2.590947
23.0 2.599250
23.5 2.606327
24.0 2.612229
24.5 2.617008
25.0 2.620715
Length: 259457, dtype: float64
cd_new.xs('2015-05-29')
Currency Maturity
GBP 0.333333 0.452339
0.416667 0.441134
0.500000 0.430168
0.583333 0.419990
0.666667 0.411208
0.750000 0.404424
0.833333 0.400017
0.916667 0.398140
1.000000 0.398806
1.083333 0.401943
1.166667 0.407427
1.250000 0.415095
1.333333 0.424762
1.416667 0.436233
1.500000 0.449322
...
GBP 18.0 2.453898
18.5 2.475052
19.0 2.494679
19.5 2.512787
20.0 2.529393
20.5 2.544519
21.0 2.558198
21.5 2.570467
22.0 2.581368
22.5 2.590947
23.0 2.599250
23.5 2.606327
24.0 2.612229
24.5 2.617008
25.0 2.620715
Length: 97, dtype: float64
太好了。我可以再问一个问题吗。有没有办法将列名value
更改为Yield\u pct
?当然,我个人喜欢使用字典,因为它很容易看到以前的内容:result.rename(列={'value':'Yield\u pct',inplace=True)