用Python重构数据帧

用Python重构数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,我收集了倒数第二张工作表中的数据,以及上一张工作表中从5.5“到期年”开始的所有数据。我有这样的代码。但是,我现在正在重新构造dataframe,使其具有以下列,并且正在努力做到这一点: 我的代码如下 import urllib2 import pandas as pd import os import xlrd url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'

我收集了倒数第二张工作表中的数据,以及上一张工作表中从5.5“到期年”开始的所有数据。我有这样的代码。但是,我现在正在重新构造dataframe,使其具有以下列,并且正在努力做到这一点:

我的代码如下

import urllib2
import pandas as pd
import os
import xlrd 

url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
socket = urllib2.urlopen(url)

xd = pd.ExcelFile(socket)

#Had to do this based on actual sheet_names rather than index as there are some extra sheet names in xd.sheet_names
df1 = xd.parse('4. spot curve', header=None)
df1 = df1.loc[:, df1.loc[3, :] >= 5.5] #Assumes the maturity is always on the 4th line of the sheet
df2 = xd.parse('3. spot, short end', header=None)

bigdata = df1.append(df2,ignore_index = True)
编辑:数据帧当前显示如下。不幸的是,当前的数据帧非常混乱:

                       0    1   2   3         4         5         6   \
0                     NaN  NaN NaN NaN       NaN       NaN       NaN   
1                     NaN  NaN NaN NaN       NaN       NaN       NaN   
2                Maturity  NaN NaN NaN       NaN       NaN       NaN   
3                  years:  NaN NaN NaN       NaN       NaN       NaN   
4                     NaN  NaN NaN NaN       NaN       NaN       NaN   
5     2005-01-03 00:00:00  NaN NaN NaN       NaN       NaN       NaN   
6     2005-01-04 00:00:00  NaN NaN NaN       NaN       NaN       NaN
...                   ...  ...  ..  ..       ...       ...       ...   
5410  2015-04-20 00:00:00  NaN NaN NaN       NaN  0.367987  0.357069   
5411  2015-04-21 00:00:00  NaN NaN NaN       NaN  0.362478  0.352581
它有5440行和61列

但是,我希望数据帧的格式为:

我认为第1、2、3、4、5和6列包含收益率曲线数据。但是,我不确定与“到期年”相关的数据在当前数据框架中的何处

Date(which is the 2nd Column in the current Dataframe)    Update time(which would just be a column with datetime.datetime.now())    Currency(which would just be a column with 'GBP')    Maturity Date    Yield Data from SpreadSheet

我使用pandas.io.excel.read_excel函数从url读取xls。下面是清理这个英国收益率曲线数据集的一种方法

注意:通过apply函数执行三次样条插值需要相当长的时间(在我的电脑中大约2分钟)。它一行一行地从大约100个点插值到300个点(总共2638个)

从pandas.io.excel导入读取excel
作为pd进口熊猫
将numpy作为np导入
url='1〕http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
#检查图纸编号,点:9/9,短端7/9
spot\u曲线=读取\u excel(url,sheetname=8)
短曲线=读取excel('uknom05\u mdaily.xls',sheetname=6)
#spot_曲线的预处理
# ==============================================
#在桌子上做一些检查
斑点曲线形状
spot_曲线.iloc[:,0]
spot_曲线.iloc[:,-1]
spot_曲线.iloc[0,:]
spot_曲线.iloc[-1,:]
#进行一些清洁,暂时保留NaN,因为产量曲线不建议向前填充NaN
spot_curve.columns=spot_curve.loc['years:']
spot_curve.columns.name='years'
有效指数=点曲线。指数[4:]
spot_曲线=spot_曲线。loc[有效指数]
#删除5年内的所有到期日,因为这些到期日在短期文件中重复
col_mask=点_曲线.columns.values>5
spot_curve=spot_curve.iloc[:,col_mask]
#现在spot_曲线准备就绪,请检查它
spot_曲线头()
spot_曲线.tail()
斑点曲线形状
斑点曲线形状
Out[184]:(2715,40)
#短端斑点曲线的预处理
# ==============================================
short_end_spot_curve.columns=short_end_spot_curve.loc['years:']
short\u end\u spot\u curve.columns.name='years'
有效指数=短终点点曲线。指数[4:]
短点曲线=短点曲线。loc[有效指数]
短端点曲线头()
短尾端点曲线尾端()
短\端\点\曲线形状
短\端\点\曲线形状
Out[185]:(2715,60)
#合并这两个,时间索引是相同的
# ==============================================
组合数据=pd.concat([短端点点曲线,点曲线],轴=1,连接='outer')
#将到期日从短端排序到长端
组合数据。排序索引(轴=1,在位=真)
组合_数据头()
组合_data.tail()
组合_data.shape
#处理NaN:最可靠的方法是拟合无套利NSS曲线
#但是,python目前不支持这一点。
#改为使用三次样条曲线
# ==============================================
#若超过一半的成熟点是NaN,那个么插值很可能不稳定,所以我将删除NaN计数大于50的所有行
def过滤器功能(组):

return group.isnull().sum(axis=1)尝试运行您的代码,但我没有安装
xlrd
。如果您创建一个小数据框来说明问题,而不必将内容从互联网上删除,我想其他人会更乐于回答。@AmiTavory我已经提供了有关当前数据框和我想要的数据框的更多信息。如果需要更多信息,请告诉我。干得好!这很有帮助。干杯。请允许我问一下,我如何将当前从左到右增加的年份转换为从上到下增加的年份(即,是否可以将其设置为一列)。非常感谢。
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np

url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'

# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel('uknom05_mdaily.xls', sheetname=6)

# preprocessing spot_curve
# ==============================================
# do a few inspection on the table
spot_curve.shape
spot_curve.iloc[:, 0]
spot_curve.iloc[:, -1]
spot_curve.iloc[0, :]
spot_curve.iloc[-1, :]
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'years'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]

# now spot_curve is ready, check it
spot_curve.head()
spot_curve.tail()
spot_curve.shape

spot_curve.shape
Out[184]: (2715, 40)

# preprocessing short end spot_curve
# ==============================================
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'years'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
short_end_spot_curve.head()
short_end_spot_curve.tail()
short_end_spot_curve.shape

short_end_spot_curve.shape
Out[185]: (2715, 60)

# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)

combined_data.head()
combined_data.tail()
combined_data.shape

# deal with NaN: the most sound approach is fit the non-arbitrage NSS curve
# however, this is not currently supported in python.
# do a cubic spline instead
# ==============================================

# if more than half of the maturity points are NaN, then interpolation is likely to be unstable, so I'll remove all rows with NaNs count greater than  50
def filter_func(group):
    return group.isnull().sum(axis=1) <= 50

combined_data = combined_data.groupby(level=0).filter(filter_func)
# no. of rows down from 2715 to 2628
combined_data.shape

combined_data.shape
Out[186]: (2628, 100)


from scipy.interpolate import interp1d

# mapping points, monthly frequency, 1 mon to 25 years
maturity = pd.Series((np.arange(12 * 25) + 1) / 12)
# do the interpolation day by day
key = lambda x: x.date
by_day = combined_data.groupby(level=0)

# write out apply function
def interpolate_maturities(group):
    # transpose row vector to column vector and drops all nans
    a = group.T.dropna().reset_index()
    f = interp1d(a.iloc[:, 0], a.iloc[:, 1], kind='cubic', bounds_error=False, assume_sorted=True)
    return pd.Series(maturity.apply(f).values, index=maturity.values)

# this may take a while .... apply provides flexibility but spead is not good
cleaned_spot_curve = by_day.apply(interpolate_maturities)

# a quick look on the data
cleaned_spot_curve.iloc[[1,1000, 2000], :].T.plot(title='Cross-Maturity Yield Curve')
cleaned_spot_curve.iloc[:, [23, 59, 119]].plot(title='Time-Series')