Python 为什么滚动应用如此缓慢?
我有一个大数据帧,100000000*50(大约4G) 我想这样计算滚动窗口的加权平均值:Python 为什么滚动应用如此缓慢?,python,pandas,rolling-computation,Python,Pandas,Rolling Computation,我有一个大数据帧,100000000*50(大约4G) 我想这样计算滚动窗口的加权平均值: #df shape is (100,000,000 * 50) from functools import partial window_size=[1,2,3,4,5,6] for i in window_size: df['triangle_mv_%d'%(i)] = df['mid'].diff(1).rolling(i).apply(partial(np.average, weights=
#df shape is (100,000,000 * 50)
from functools import partial
window_size=[1,2,3,4,5,6]
for i in window_size:
df['triangle_mv_%d'%(i)] = df['mid'].diff(1).rolling(i).apply(partial(np.average, weights=range(i)))
import pandas as pd
import numpy as np
from pandas.core.window.rolling import _flex_binary_moment, _Rolling_and_Expanding
def weighted_mean(self, weights, **kwargs):
weights = self._shallow_copy(weights)
window = self._get_window(weights)
def _get_weighted_mean(X, Y):
X = X.astype('float64')
Y = Y.astype('float64')
sum_f = lambda x: x.rolling(window, self.min_periods, center=self.center).sum(**kwargs)
print(X)
print(Y)
return sum_f(X * Y) / sum_f(Y)
return _flex_binary_moment(self._selected_obj, weights._selected_obj,
_get_weighted_mean, pairwise=True)
_Rolling_and_Expanding.weighted_mean = weighted_mean
df = pd.DataFrame(np.reshape(range(25), (5,5)))
print(df[1].rolling(2).weighted_mean(pd.Series([1,2]))) # this is wrong, expected result should have 4 values, but there is only one valid values in output like this [NAN, 4.333, NAN, NAN, NAN]
我发现它相当慢,一个循环,花费超过15分钟
我不能理解这一点,因为滚动(我的意思是)相当快,我只是叫应用加权平均值,怎么会这么慢
我也学了很多,一些裁判告诉我重写加权平均值函数,以便像这样滚动:
#df shape is (100,000,000 * 50)
from functools import partial
window_size=[1,2,3,4,5,6]
for i in window_size:
df['triangle_mv_%d'%(i)] = df['mid'].diff(1).rolling(i).apply(partial(np.average, weights=range(i)))
import pandas as pd
import numpy as np
from pandas.core.window.rolling import _flex_binary_moment, _Rolling_and_Expanding
def weighted_mean(self, weights, **kwargs):
weights = self._shallow_copy(weights)
window = self._get_window(weights)
def _get_weighted_mean(X, Y):
X = X.astype('float64')
Y = Y.astype('float64')
sum_f = lambda x: x.rolling(window, self.min_periods, center=self.center).sum(**kwargs)
print(X)
print(Y)
return sum_f(X * Y) / sum_f(Y)
return _flex_binary_moment(self._selected_obj, weights._selected_obj,
_get_weighted_mean, pairwise=True)
_Rolling_and_Expanding.weighted_mean = weighted_mean
df = pd.DataFrame(np.reshape(range(25), (5,5)))
print(df[1].rolling(2).weighted_mean(pd.Series([1,2]))) # this is wrong, expected result should have 4 values, but there is only one valid values in output like this [NAN, 4.333, NAN, NAN, NAN]
有人能帮忙吗?如何快速实现此功能?为什么apply方法这么慢?apply实际上只是一个方便的函数。。。它基本上和汽车一样慢loop@JoranBeasley. 好的。。。。我认为它应该足够有效,但rolling.mean相当快,这让我很困惑。读熊猫的源代码对我来说太难了