Python 将2D数据帧转换为系列的最快方法
我有一个大数据框,其中包含符号和日期的股票回报。大概是这样的:Python 将2D数据帧转换为系列的最快方法,python,pandas,Python,Pandas,我有一个大数据框,其中包含符号和日期的股票回报。大概是这样的: 2018-10-06 2018-11-17 2018-12-29 ... 2020-09-19 2020-10-31 2020-12-12 BIOL -15.33 -22.05 84.85 ... -10.37 11.20 274.15 SRDX -11.67 -16.84 12.06 ... -4
2018-10-06 2018-11-17 2018-12-29 ... 2020-09-19 2020-10-31 2020-12-12
BIOL -15.33 -22.05 84.85 ... -10.37 11.20 274.15
SRDX -11.67 -16.84 12.06 ... -4.66 4.43 17.36
LPTH -2.65 -19.02 2.68 ... -1.63 21.58 32.08
VHI -4.91 -8.50 55.96 ... -4.18 25.68 0.12
THMO 21.21 -41.98 30.01 ... -33.89 2.99 39.29
我需要将其转换为单个DataFrame列,每行只包含一个数据点。像这样:
(2018-10-06 00:00:00, BIOL) -15.33
(2018-10-06 00:00:00, SRDX) -11.67
(2018-10-06 00:00:00, LPTH) -2.65
(2018-10-06 00:00:00, VHI) -4.91
(2018-10-06 00:00:00, THMO) 21.21
... ...
(2020-12-12 00:00:00, BIOL) 274.15
(2020-12-12 00:00:00, SRDX) 17.36
(2020-12-12 00:00:00, LPTH) 32.08
(2020-12-12 00:00:00, VHI) 0.12
(2020-12-12 00:00:00, THMO) 39.29
我的代码可以工作,但速度很慢。最快的方法是什么
示例代码:
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame.from_dict({
Timestamp('2018-10-06 00:00:00') : {
'BIOL': -15.33, 'SRDX': -11.67, 'LPTH': -2.65, 'VHI': -4.91, 'THMO': 21.21
}, Timestamp('2018-11-17 00:00:00'): {
'BIOL': -22.05, 'SRDX': -16.84, 'LPTH': -19.02, 'VHI': -8.5, 'THMO': -41.98
}, Timestamp('2018-12-29 00:00:00'): {
'BIOL': 84.85, 'SRDX': 12.06, 'LPTH': 2.68, 'VHI': 55.96, 'THMO': 30.01
}, Timestamp('2019-02-09 00:00:00'): {
'BIOL': 31.15, 'SRDX': -22.09, 'LPTH': -0.65, 'VHI': -23.89, 'THMO': -13.54
}, Timestamp('2019-03-23 00:00:00'): {
'BIOL': -11.25, 'SRDX': 8.56, 'LPTH': 1.97, 'VHI': 5.26, 'THMO': -12.0
}, Timestamp('2019-05-04 00:00:00'): {
'BIOL': -26.29, 'SRDX': -8.73, 'LPTH': -40.7, 'VHI': -6.99, 'THMO': 5.68
}, Timestamp('2019-06-15 00:00:00'): {
'BIOL': -2.55, 'SRDX': -2.47, 'LPTH': -17.32, 'VHI': -5.88, 'THMO': 3.58
}, Timestamp('2019-07-27 00:00:00'): {
'BIOL': -37.91, 'SRDX': 11.61, 'LPTH': -1.32, 'VHI': 1.98, 'THMO': 41.87
}, Timestamp('2019-09-07 00:00:00'): {
'BIOL': -27.24, 'SRDX': 0.45, 'LPTH': -3.47, 'VHI': -4.29, 'THMO': 29.51
}, Timestamp('2019-10-19 00:00:00'): {
'BIOL': -20.43, 'SRDX': -8.95, 'LPTH': -24.03, 'VHI': -7.5, 'THMO': -39.17
}, Timestamp('2019-11-30 00:00:00'): {
'BIOL': 5.47, 'SRDX': -0.74, 'LPTH': 20.0, 'VHI': -4.89, 'THMO': 21.98
}, Timestamp('2020-01-11 00:00:00'): {
'BIOL': 24.12, 'SRDX': -9.33, 'LPTH': 110.61, 'VHI': -14.29, 'THMO': 15.74
}, Timestamp('2020-02-22 00:00:00'): {
'BIOL': -68.06, 'SRDX': -8.63, 'LPTH': -25.18, 'VHI': -28.96, 'THMO': -19.74
}, Timestamp('2020-04-04 00:00:00'): {
'BIOL': 65.43, 'SRDX': 5.53, 'LPTH': 106.73, 'VHI': -20.26, 'THMO': 105.19
}, Timestamp('2020-05-16 00:00:00'): {
'BIOL': 25.47, 'SRDX': 22.79, 'LPTH': 40.93, 'VHI': 2.14, 'THMO': -27.56
}, Timestamp('2020-06-27 00:00:00'): {
'BIOL': -7.96, 'SRDX': 8.95, 'LPTH': -3.63, 'VHI': 7.95, 'THMO': 17.65
}, Timestamp('2020-08-08 00:00:00'): {
'BIOL': -32.86, 'SRDX': -18.31, 'LPTH': -16.1, 'VHI': 31.41, 'THMO': -48.59
}, Timestamp('2020-09-19 00:00:00'): {
'BIOL': -10.37, 'SRDX': -4.66, 'LPTH': -1.63, 'VHI': -4.18, 'THMO': -33.89
}, Timestamp('2020-10-31 00:00:00'): {
'BIOL': 11.2, 'SRDX': 4.43, 'LPTH': 21.58, 'VHI': 25.68, 'THMO': 2.99
}, Timestamp('2020-12-12 00:00:00'): {
'BIOL': 274.15, 'SRDX': 17.36, 'LPTH': 32.08, 'VHI': 0.12, 'THMO': 39.29
}
})
print(df)
d = {}
for i, col in df.iteritems():
d.update({(name, date): pct
for name, date, pct in zip([col.name] * len(col), col.index, col)})
df2 = pd.DataFrame.from_dict(d, orient='index')
print(df2)
用于日期时间在第一级的多重索引系列:
s = df.unstack()
print (s)
2018-10-06 BIOL -15.33
SRDX -11.67
LPTH -2.65
VHI -4.91
THMO 21.21
2020-12-12 BIOL 274.15
SRDX 17.36
LPTH 32.08
VHI 0.12
THMO 39.29
Length: 100, dtype: float64
或者,如果需要在第二级使用日期时间:
Numpy替代品,带系列
构造函数,带:
s = df.stack()
print (s)
BIOL 2018-10-06 -15.33
2018-11-17 -22.05
2018-12-29 84.85
2019-02-09 31.15
2019-03-23 -11.25
THMO 2020-06-27 17.65
2020-08-08 -48.59
2020-09-19 -33.89
2020-10-31 2.99
2020-12-12 39.29
Length: 100, dtype: float64
c = np.tile(df.columns, len(df))
r = np.repeat(df.index, len(df.columns))
v = np.ravel(df, order='F')
s = pd.Series(v, index=pd.MultiIndex.from_arrays([r, c]))
print (s)
BIOL 2018-10-06 -15.33
2018-11-17 -11.67
2018-12-29 -2.65
2019-02-09 -4.91
2019-03-23 21.21
THMO 2020-06-27 274.15
2020-08-08 17.36
2020-09-19 32.08
2020-10-31 0.12
2020-12-12 39.29
Length: 100, dtype: float64