Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/315.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 为什么groupby.diff速度如此之慢?_Python_Pandas - Fatal编程技术网

Python 为什么groupby.diff速度如此之慢?

Python 为什么groupby.diff速度如此之慢?,python,pandas,Python,Pandas,我想计算每个组的序列差,如下例所示: In [24]: rnd_ser = pd.Series(np.random.randn(5000)) ...: com_ser = pd.concat([rnd_ser] * 500, keys=np.arange(500), names=['Date', 'ID']) In [25]: d1 = com_ser.groupby("Date").diff() In [26]: d2 = com_ser - com_ser.groupby("Da

我想计算每个组的序列差,如下例所示:

In [24]: rnd_ser = pd.Series(np.random.randn(5000))
    ...: com_ser = pd.concat([rnd_ser] * 500, keys=np.arange(500), names=['Date', 'ID'])

In [25]: d1 = com_ser.groupby("Date").diff()

In [26]: d2 = com_ser - com_ser.groupby("Date").shift()

In [27]: np.allclose(d1.fillna(0), d2.fillna(0))
Out[27]: True
有两种方法可以解决此问题,但第一种方法性能较差:

In [30]: %timeit d1 = com_ser.groupby("Date").diff()
616 ms ± 5.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [31]: %timeit d2 = com_ser - com_ser.groupby("Date").shift()
95 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
这是预期的还是错误

我的环境详情如下:

In [23]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.9.3
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: 1.8.1
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

FWIW,我在我的机器上看到了类似的数字

%timeit d1 = com_ser.groupby("Date").diff()
523 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit d2 = com_ser - com_ser.groupby("Date").shift()
80.8 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
使用
groupby()

例如,如果我制作了一个大系列

big_ser=pd.Series(np.random.randn(int(1e7)))

然后比较移位和减法与
Series.diff()

那么实现之间的时间是相同的。下面,当您查看
Series.diff的内部源代码时,它在注释中明确指出

def diff(arr, n, axis=0):
    """
    difference of n between self,
    analogous to s-s.shift(n)

因此,我认为这一定是
groupby
中的一些开销,具体到
diff()

我见过类似的事情“出于某种原因。groupby with.diff使用大量内存,而且效率相当低”
def diff(arr, n, axis=0):
    """
    difference of n between self,
    analogous to s-s.shift(n)