Python 重组数据帧_Python_Pandas

Python 重组数据帧

python pandas

Python 重组数据帧,python,pandas,Python,Pandas,我有一个dataframe，目前看起来如下，有262800行和3列。我的数据帧当前如下所示： Currency Maturity value 0 GBP 0.08333333 4.709456 1 GBP 0.08333333 4.713099 2 GBP 0.08333333 4.707237 3 GBP 0.08333333 4.705043 4 G

我有一个dataframe，目前看起来如下，有262800行和3列。我的数据帧当前如下所示：

       Currency    Maturity     value
0           GBP  0.08333333  4.709456
1           GBP  0.08333333  4.713099
2           GBP  0.08333333  4.707237
3           GBP  0.08333333  4.705043
4           GBP  0.08333333  4.697150
5           GBP  0.08333333  4.710647
6           GBP  0.08333333  4.701150
7           GBP  0.08333333  4.694639
8           GBP  0.08333333  4.686111
9           GBP  0.08333333  4.714750
......
262770      GBP          25  2.432869

from pandas.io.excel import read_excel
import pandas as pd
import numpy as np

url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'

# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel(url, sheetname=6)

# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'Maturity'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]


short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'Maturity'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]

# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)

def filter_func(group):
    return group.isnull().sum(axis=1) <= 50

combined_data = combined_data.groupby(level=0).filter(filter_func)

idx = 0
values = ['GBP'] * len(combined_data.index)
combined_data.insert(idx, 'Currency', values) 

#print combined_data.columns.values

#I had to do the melt 
combined_data = pd.melt(combined_data,id_vars=['Currency'])#Arbitrarily melted on 'Currency' as for some reason when I do print combined_data.columns.values I see that 'Currency' corresponds to 0.08333333, etc.
print combined_data

我希望数据帧的格式如下。我已经为此采取了一些措施，包括在下面的代码中使用

melt

，但出于某种原因，我去掉了

Date

列，并生成了上面的数据框。我不确定如何取回日期列并获取以下数据框：

   Maturity     Date            Currency  Yield_pct
0  0.08333333   2005-01-04      GBP       4.709456              
1  0.08333333   2005-01-05      GBP       4.713099               
2  0.08333333   2005-01-06      GBP       4.707237
....
9  25           2005-01-04      GBP       2.432869

我的代码如下：

       Currency    Maturity     value
0           GBP  0.08333333  4.709456
1           GBP  0.08333333  4.713099
2           GBP  0.08333333  4.707237
3           GBP  0.08333333  4.705043
4           GBP  0.08333333  4.697150
5           GBP  0.08333333  4.710647
6           GBP  0.08333333  4.701150
7           GBP  0.08333333  4.694639
8           GBP  0.08333333  4.686111
9           GBP  0.08333333  4.714750
......
262770      GBP          25  2.432869

from pandas.io.excel import read_excel
import pandas as pd
import numpy as np

url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'

# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel(url, sheetname=6)

# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'Maturity'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]


short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'Maturity'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]

# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)

def filter_func(group):
    return group.isnull().sum(axis=1) <= 50

combined_data = combined_data.groupby(level=0).filter(filter_func)

idx = 0
values = ['GBP'] * len(combined_data.index)
combined_data.insert(idx, 'Currency', values) 

#print combined_data.columns.values

#I had to do the melt 
combined_data = pd.melt(combined_data,id_vars=['Currency'])#Arbitrarily melted on 'Currency' as for some reason when I do print combined_data.columns.values I see that 'Currency' corresponds to 0.08333333, etc.
print combined_data

从pandas.io.excel导入读取excel
作为pd进口熊猫
将numpy作为np导入
url='1〕http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
#检查图纸编号，点：9/9，短端7/9
spot\u曲线=读取\u excel（url，sheetname=8）
短\末端\点\曲线=读取\ excel（url，sheetname=6）
#进行一些清洁，暂时保留NaN，因为产量曲线不建议向前填充NaN
spot_curve.columns=spot_curve.loc['years:']
spot_curve.columns.name=‘到期日’
有效指数=点曲线。指数[4:]
spot_曲线=spot_曲线。loc[有效指数]
#删除5年内的所有到期日，因为这些到期日在短期文件中重复
col_mask=点_曲线.columns.values>5
spot_curve=spot_curve.iloc[：，col_mask]
short_end_spot_curve.columns=short_end_spot_curve.loc['years:']
short\u end\u spot\u curve.columns.name='到期日'
有效指数=短终点点曲线。指数[4:]
短点曲线=短点曲线。loc[有效指数]
#合并这两个，时间索引是相同的
# ==============================================
组合数据=pd.concat（[短端点点曲线，点曲线]，轴=1，连接='outer'）
#将到期日从短端排序到长端
组合数据。排序索引（轴=1，在位=真）
def过滤器功能（组）：
返回组.isnull（）.sum（axis=1）您不能在melt
后面添加货币标识符吗
# Copy up to this stage
combined_data = combined_data.groupby(level=0).filter(filter_func)

# My code from here
combined_data.reset_index(inplace=True, drop=False)
combined_data.rename(columns={'index': 'Date'}, inplace=True)

# This line assumes you want datetime, ignore if you don't
combined_data['Date'] = pd.to_datetime(combined_data['Date'])

result = pd.melt(combined_data, id_vars=['Date'])

result['Currency'] = 'GBP'

result.head（）的输出
在首先重置索引以包含货币后，尝试堆叠结果
cd = combined_data.reset_index().set_index(['index', 'Currency'])
cd_new = cd.stack()
>>> cd_new
index       Currency  Maturity
2005-01-04  GBP       0.083333    4.709456
                      0.166667    4.633861
                      0.250000    4.586271
                      0.333333    4.567017
                      0.416667    4.559578
                      0.500000    4.553227
                      0.583333    4.543976
                      0.666667    4.530881
                      0.750000    4.514742
                      0.833333    4.497187
                      0.916667    4.479690
                      1.000000    4.463105
                      1.083333    4.447843
                      1.166667    4.434076
                      1.250000    4.421868
...
2015-05-29  GBP       18.0        2.453898
                      18.5        2.475052
                      19.0        2.494679
                      19.5        2.512787
                      20.0        2.529393
                      20.5        2.544519
                      21.0        2.558198
                      21.5        2.570467
                      22.0        2.581368
                      22.5        2.590947
                      23.0        2.599250
                      23.5        2.606327
                      24.0        2.612229
                      24.5        2.617008
                      25.0        2.620715
Length: 259457, dtype: float64

cd_new.xs('2015-05-29')
Currency  Maturity
GBP       0.333333    0.452339
          0.416667    0.441134
          0.500000    0.430168
          0.583333    0.419990
          0.666667    0.411208
          0.750000    0.404424
          0.833333    0.400017
          0.916667    0.398140
          1.000000    0.398806
          1.083333    0.401943
          1.166667    0.407427
          1.250000    0.415095
          1.333333    0.424762
          1.416667    0.436233
          1.500000    0.449322
...
GBP       18.0        2.453898
          18.5        2.475052
          19.0        2.494679
          19.5        2.512787
          20.0        2.529393
          20.5        2.544519
          21.0        2.558198
          21.5        2.570467
          22.0        2.581368
          22.5        2.590947
          23.0        2.599250
          23.5        2.606327
          24.0        2.612229
          24.5        2.617008
          25.0        2.620715
Length: 97, dtype: float64

太好了。我可以再问一个问题吗。有没有办法将列名value
更改为Yield\u pct
？当然，我个人喜欢使用字典，因为它很容易看到以前的内容：result.rename（列={'value'：'Yield\u pct'，inplace=True）