Python Pandas:通过向左移动行来转换上三角数据帧
我有一个看起来“上三角”的数据框: 我想通过将Python Pandas:通过向左移动行来转换上三角数据帧,python,pandas,Python,Pandas,我有一个看起来“上三角”的数据框: 我想通过将Ith行向左移动I-1来转换它: 31-May-11 30-Jun-11 31-Jul-11 31-Aug-11 30-Sep-11 31-Oct-11 OpenDate 2011-05-31 68.432797 81.696071 75.083249 66.659008 68.898034 72.622304 2011-06-30 1.711097 1.501082 1.625213 1.
I
th行向左移动I-1
来转换它:
31-May-11 30-Jun-11 31-Jul-11 31-Aug-11 30-Sep-11 31-Oct-11
OpenDate
2011-05-31 68.432797 81.696071 75.083249 66.659008 68.898034 72.622304
2011-06-30 1.711097 1.501082 1.625213 1.774645 1.661183 NaN
2011-07-31 0.422364 0.263561 0.203572 0.234376 NaN NaN
2011-08-31 1.077009 1.226946 1.520701 NaN NaN NaN
2011-09-30 0.667091 0.495993 NaN NaN NaN NaN
编辑:
我不能排除矩阵的上半部分可能存在NAN,因此我们可能会看到类似的情况:
31-May-11 30-Jun-11 31-Jul-11 31-Aug-11 30-Sep-11 31-Oct-11
OpenDate
2011-05-31 68.432797 81.696071 75.083249 66.659008 68.898034 72.622304
2011-06-30 NaN NaN 1.501082 1.625213 1.774645 1.661183
2011-07-31 NaN NaN 0.422364 0.263561 0.203572 0.234376
2011-08-31 NaN NaN NaN 1.077009 1.226946 1.520701
2011-09-30 NaN NaN NaN NaN 0.667091 0.495993
应该变成什么
31-May-11 30-Jun-11 31-Jul-11 31-Aug-11 30-Sep-11 31-Oct-11
OpenDate
2011-05-31 68.432797 81.696071 75.083249 66.659008 68.898034 72.622304
2011-06-30 NaN 1.501082 1.625213 1.774645 1.661183 NaN
2011-07-31 0.422364 0.263561 0.203572 0.234376 NaN NaN
2011-08-31 1.077009 1.226946 1.520701 NaN NaN NaN
2011-09-30 0.667091 0.495993 NaN NaN NaN NaN
有什么办法可以做到这一点吗
谢谢,
Anne您可以计算NaN值,删除它们,然后在末尾再次追加相同的数量。比如:
def shift_df(row):
n = len(row)
new_row = row.dropna().tolist()
new_row += ([np.nan]*(n-len(new_row)))
return pd.Series(new_row, index=row.index)
df.apply(shift_df, axis=1)
其中,df
是您的数据帧。这仅在“正常”数据之间没有NaN值时有效
df.apply(lambda x: x.shift(-x.notnull().argmax()), 1)
lambda函数查找第一个非空值的位置,并相应地移动行。这有两个问题:它没有利用已知的结构(上三角),因此可能会牺牲一些速度,而且,它可能会被数据中额外的N所愚弄
更新
一个更健壮的解决方案,使用itertools的计数器
from itertools import count
c = count()
df.apply(lambda x: x.shift(-c.next() + 1), 1)
正如预期的那样,这要快一点
In [47]: %timeit df.apply(lambda x: x.shift(-c.next() + 1), 1)
1000 loops, best of 3: 766 us per loop
In [49]: %timeit df.apply(lambda x: x.shift(-x.notnull().argmax()), 1)
1000 loops, best of 3: 1.08 ms per loop
设置
不知道这会有多快
In [21]: def f(i,x):
....: return x.shift(-i+1)
....:
In [31]: DataFrame([ f(i,x) for i,x in df.iterrows() ])
Out[31]:
0 1 2 3
0 NaN 0 1 2
1 4 5 6 7
2 9 10 11 NaN
3 14 15 NaN NaN
4 19 NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
这里有一种方法可以使用
numpy
输入:
In [96]: df
Out[96]:
1 2 3 4 5 6
0
2011-05-31 68.433 81.696 75.083 66.659 68.898 72.622
2011-06-30 NaN 1.711 1.501 1.625 1.775 1.661
2011-07-31 NaN NaN 0.422 0.264 0.204 0.234
2011-08-31 NaN NaN NaN 1.077 1.227 1.521
2011-09-30 NaN NaN NaN NaN 0.667 0.496
代码
输出:
1 2 3 4 5 6
0
2011-05-31 68.433 81.696 75.083 66.659 68.898 72.622
2011-06-30 1.711 1.501 1.625 1.775 1.661 NaN
2011-07-31 0.422 0.264 0.204 0.234 NaN NaN
2011-08-31 1.077 1.227 1.521 NaN NaN NaN
2011-09-30 0.667 0.496 NaN NaN NaN NaN
让我们timeit
In [95]: %%timeit
....: roller = lambda (i, x): np.roll(x, -i)
....: row_terator = enumerate(df.values)
....: rolled = map(roller, row_terator)
....: result = DataFrame(np.vstack(rolled), index=df.index, columns=df.columns)
....:
10000 loops, best of 3: 101 us per loop
请注意,
np.roll
在这里是很重要的。它需要一个数组、一个整数位数的移位和一个轴
参数,这样您就可以将ndarray
沿着它的任意轴移位。作为将来的参考,就像我在numpy中做的一样
如果您的数据是numpy数组,另一种可能是:
In [75]: m
Out[75]:
array([[-0.69269313, -1.83256202, -0.61047484, 2.22505336, 0.65253538],
[ 0. , 0.21960176, 1.82940845, -1.94429684, -0.42096599],
[ 0. , 0. , 0.44483682, -0.56272361, 0.15877905],
[ 0. , 0. , 0. , -0.54694672, 0.20022243],
[ 0. , 0. , 0. , 0. , 1.82054127]])
In [76]: i = np.triu_indices(len(m))
In [77]: m2 = np.zeros_like(m)
In [78]: m2[i[0], i[1]-i[0]] = m[i]
In [79]: m2
Out[79]:
array([[-0.69269313, -1.83256202, -0.61047484, 2.22505336, 0.65253538],
[ 0.21960176, 1.82940845, -1.94429684, -0.42096599, 0. ],
[ 0.44483682, -0.56272361, 0.15877905, 0. , 0. ],
[-0.54694672, 0.20022243, 0. , 0. , 0. ],
[ 1.82054127, 0. , 0. , 0. , 0. ]])
当然,如果您想用NaN填充,您可以将m2
矩阵初始化为该值,而不是零
不过,我不确定哪种方法更有效。谢谢Rutger。不幸的是,我不能对南部作出任何假设。我编辑了这个问题来澄清。谢谢,丹。不幸的是,我不能对南部作出任何假设。我对问题进行了编辑以澄清。嗨,Dan,我实现了你的解决方案,但我发现了一个奇怪的问题:
x.shift(-c.next())
将对第一行应用两次,将计数器移动到2,但对其余行可以正常工作。这对我来说没什么意义。。。是因为索引中的日期时间吗?当我更改函数以打印出它正在处理的行的名称时,我得到2011-05-31T01:00:00.000000000+0100,2011-05-31 00:00:00,2011-06-30 00:00:00
等,因此您可以看到2011年5月31日被处理了两次。奇怪的是,结果是2011年5月31日只有一行。有什么想法吗?这就是为什么我有x.shift(-c.next()+1)
。我相信pandas在实际使用lambda函数生成结果之前会调用它一次。当pandas开始一项大型操作时,它有时会探索单独的“路径”,即执行函数的方式,以寻找最快的一种运行方式。我还没有深入研究代码来检查这里是否发生了这种情况,但这是我的猜测。由于c
是一个生成器,因此路径探索是有问题的,因为它将c
前进1。我采取了一种(值得怀疑的!)调整的方法,并说:“足够好了;它很有效。”@DanAllan完全正确;apply确实(故意)调用了两次,以查看是否有适当的修改(在这种情况下,采用慢路径);否则可以走更快的路。根据问题,没有必要做任何转换。输出中的索引与输入中的索引相同。请参阅下面我的答案,它使用numpy
,速度约为7倍。我不确定i=0的情况,但如果您愿意,您可以在shift
中执行条件转换(如果需要)
1 2 3 4 5 6
0
2011-05-31 68.433 81.696 75.083 66.659 68.898 72.622
2011-06-30 1.711 1.501 1.625 1.775 1.661 NaN
2011-07-31 0.422 0.264 0.204 0.234 NaN NaN
2011-08-31 1.077 1.227 1.521 NaN NaN NaN
2011-09-30 0.667 0.496 NaN NaN NaN NaN
In [95]: %%timeit
....: roller = lambda (i, x): np.roll(x, -i)
....: row_terator = enumerate(df.values)
....: rolled = map(roller, row_terator)
....: result = DataFrame(np.vstack(rolled), index=df.index, columns=df.columns)
....:
10000 loops, best of 3: 101 us per loop
In [75]: m
Out[75]:
array([[-0.69269313, -1.83256202, -0.61047484, 2.22505336, 0.65253538],
[ 0. , 0.21960176, 1.82940845, -1.94429684, -0.42096599],
[ 0. , 0. , 0.44483682, -0.56272361, 0.15877905],
[ 0. , 0. , 0. , -0.54694672, 0.20022243],
[ 0. , 0. , 0. , 0. , 1.82054127]])
In [76]: i = np.triu_indices(len(m))
In [77]: m2 = np.zeros_like(m)
In [78]: m2[i[0], i[1]-i[0]] = m[i]
In [79]: m2
Out[79]:
array([[-0.69269313, -1.83256202, -0.61047484, 2.22505336, 0.65253538],
[ 0.21960176, 1.82940845, -1.94429684, -0.42096599, 0. ],
[ 0.44483682, -0.56272361, 0.15877905, 0. , 0. ],
[-0.54694672, 0.20022243, 0. , 0. , 0. ],
[ 1.82054127, 0. , 0. , 0. , 0. ]])