Python 熊猫：从多索引中的日期选择_Python_Pandas

Python 熊猫：从多索引中的日期选择

python pandas

Python 熊猫：从多索引中的日期选择,python,pandas,Python,Pandas,假设我有多索引序列 date foo 2006-01-01 1 12931926.310 3 11084049.460 5 10812205.359 7 9031510.239 9 5324054.903 2007-01-01 1 11086082.624 3

假设我有多索引序列

date        foo
2006-01-01  1         12931926.310
            3         11084049.460
            5         10812205.359
            7          9031510.239
            9          5324054.903
2007-01-01  1         11086082.624
            3         12028419.560
            5         11957253.031
            7         10643307.061
            9          6034854.915

如果它不是一个多索引，我可以通过

df.loc['2007']

选择那些年份

的索引。我在这里怎么做？我的自然猜测是

df.loc['2007'，：]

，但这给了我一个空的

系列（[]，名称：FINLWT21，数据类型：float64）

最终目标最后，我还想用2007年的行替换2007年不同日期的所有行

也就是说，我的预期输出是

date        foo
2006-01-01  1         11086082.624
            3         12028419.560
            5         11957253.031
            7         10643307.061
            9          6034854.915
2007-01-01  1         11086082.624
            3         12028419.560
            5         11957253.031
            7         10643307.061
            9          6034854.915

我试图实现@unutbu的解决方案，但是

mySeries.loc[dateIndex.year != 2007] = mySeries.loc[dateIndex.year == 2007]

将自然地将值（由于RHS上不存在）设置为

NaN

。通常，这些问题由

mySeries.loc[dateIndex.year != 2007] = mySeries.loc[dateIndex.year == 2007].values

，但考虑到左侧有

值（在我的真实数据集中还有更多），但右侧只有

，我得到

ValueError: cannot set using a list-like indexer with a different length than the value

我现在想到的唯一替代方法是迭代第一个索引，然后对每个子组使用上一个命令，但这似乎不是最有效的解决方案。

给定序列

In [207]: series
Out[212]: 
date        foo
2006-01-01  1      12931926.310
            3      11084049.460
            5      10812205.359
            7       9031510.239
            9       5324054.903
2007-01-01  1      11086082.624
            3      12028419.560
            5      11957253.031
            7      10643307.061
            9       6034854.915
Name: val, dtype: float64

您可以使用

dateindex = series.index.get_level_values('date')
# Ensure the dateindex is a DatetimeIndex (as opposed to a plain Index)
dateindex = pd.DatetimeIndex(dateindex)

现在，可以选择年份等于2007的行布尔条件：

# select rows where year equals 2007
series2007 = series.loc[dateindex.year == 2007]

如果

foo

值在每个日期以相同的顺序循环使用相同的值，然后，您可以将该系列中的所有值替换为2007年的值

N = len(series)/len(series2007)
series[:] = np.tile(series.loc[dateindex.year == 2007].values, N)

使用

np.tile

和

.values

的一个优点是，它将相对快速地生成所需的值数组。一个（可能的）缺点是，这忽略了索引，因此它依赖于假设

foo

值以相同的顺序循环每个日期的相同值

更健壮（但速度较慢）的方法是使用联接：

df = series.reset_index('date')
df2007 = df.loc[dateindex.year==2007]
df = df.join(df2007, rsuffix='_2007')
df = df[['date', 'val_2007']]
df = df.set_index(['date'], append=True)
df = df.swaplevel(0,1).sort_index()

屈服

In [304]: df.swaplevel(0,1).sort_index()
Out[304]: 
                    val_2007
date       foo              
2006-01-01 1    11086082.624
           3    12028419.560
           5    11957253.031
           7    10643307.061
           9     6034854.915
2007-01-01 1    11086082.624
           3    12028419.560
           5    11957253.031
           7    10643307.061
           9     6034854.915
2008-01-01 1    11086082.624
           3    12028419.560
           5    11957253.031
           7    10643307.061
           9     6034854.915

要从所需年份（如2007年）的多索引中选择值，可以使用：

target_year = 2007
df[[ts.year == target_year for ts in df.index.get_level_values(0)]]

如果日期索引不是时间戳的形式，则需要转换：

df[[pd.Timestamp(ts).year == target_year for ts in df.index.get_level_values(0)]]

如果序列包含闰日，如

2008-02-29

，该怎么办？如何映射到

中的某个日期？@unutbu它不是：数据是季度或年度的，在任何情况下，任何一天都是

。对不起，你误解了我的最终目标。但是给出你的第一个答案，这是直截了当的：

series.loc[dateindex.year！=2007]=series.loc[dateindex.year==2007]

应该做我想做的事情。我想那会分配

NaN

s。你的意思是

series.loc[dateindex.year！=2007]=series.loc[dateindex.year==2007].values

？是的，我的意思就是这样，现在两个都试过了，但后者也会引发问题，请参阅更新的问题。