Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/performance/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何提高我用Python填补时间序列和数据列表空白的性能_Python_Performance_Math_Optimization_Numeric - Fatal编程技术网

如何提高我用Python填补时间序列和数据列表空白的性能

如何提高我用Python填补时间序列和数据列表空白的性能,python,performance,math,optimization,numeric,Python,Performance,Math,Optimization,Numeric,我有一个时间序列数据集,由10赫兹的数据组成。一年中,我的数据大约有3.1*10^8行数据(每行有一个时间戳和8个浮点值)。我的数据有缺口,我需要识别并用“NaN”填充。下面我的python代码能够做到这一点,但对于我的问题来说,性能太差了。我无法在任何接近合理时间的情况下通过我的数据集 下面是一个简单的工作示例。 例如,我有序列(时间序列数据)和长度相同的LIT数据: series = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1,

我有一个时间序列数据集,由10赫兹的数据组成。一年中,我的数据大约有3.1*10^8行数据(每行有一个时间戳和8个浮点值)。我的数据有缺口,我需要识别并用“NaN”填充。下面我的python代码能够做到这一点,但对于我的问题来说,性能太差了。我无法在任何接近合理时间的情况下通过我的数据集

下面是一个简单的工作示例。 例如,我有序列(时间序列数据)和长度相同的LIT数据:

series      = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a      = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b      = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]
我希望序列以1的间隔前进,因此序列的间隔为4.1、5.1、6.1、11.1、12.1、13.1、17.1、18.1、19.1。数据表a和数据表b应填写浮动(nan)。 例如,数据应为:

[1.2,1.2,1.2,楠,楠,楠,2.2,2.2,2.2,楠,楠,楠,楠,3.2,3.2,楠,楠,楠,楠,楠,4.2]

我使用以下方式存档此文件:

d_max = 1.0    # Normal increment in series where no gaps shall be filled
shift = 0

for i in range(len(series)-1):
    diff = series[i+1] - series[i]
    if diff > d_max:
        num_fills = round(diff/d_max)-1    # Number of fills within one gap
        for it in range(num_fills):
            data_a.insert(i+1+it+shift, float(nan))
            data_b.insert(i+1+it+shift, float(nan))
        shift = int(shift + num_fills)     # Shift the index by the number of inserts from the previous gap filling

我搜索了这个问题的其他解决方案,但只发现了find()函数的使用,该函数生成了缺口的索引。函数find()比我的解决方案快吗?但是,我如何才能以更有效的方式在数据a和数据b中插入NaN?

首先,认识到您的最内部循环不是必需的:

for it in range(num_fills):
    data_a.insert(i+1+it+shift, float(nan))

data_a[i+1+shift:i+1+shift] = [float(nan)] * int(num_fills)
这可能会使它稍微快一点,因为分配更少,移动的项目也更少

然后,对于大型数值问题,始终使用。学习可能需要一些努力,但性能可能会提高几个数量级。从以下内容开始:

import numpy as np

series = np.array([1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1])
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

d_max = 1.0    # Normal increment in series where no gaps shall be filled
shift = 0

# the following two statements use NumPy's broadcasting
# to implicit run some loop at the C level
diff = series[1:] - series[:-1]
num_fills = np.round(diff / d_max) - 1
for i in np.where(diff > d_max)[0]:
    nf = num_fills[i]
    nans = [np.nan] * nf
    data_a[i+1+shift:i+1+shift] = nans
    data_b[i+1+shift:i+1+shift] = nans
    shift = int(shift + nf)
from itertools import izip

series      = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a      = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b      = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

def fillGaps(series,data_a,data_b,d_max=1.0):
  prev = None
  for s, a, b in izip(series,data_a,data_b):
    if prev is not None:
      diff = s - prev
      if s - prev > d_max:
        for x in xrange(int(round(diff/d_max))-1):
          yield (float('nan'),float('nan'))
    prev = s
    yield (a,b)

newA = []
newB = []
for a,b in fillGaps(series,data_a,data_b):
  newA.append(a)
  newB.append(b)

首先,要意识到最里面的循环是不必要的:

for it in range(num_fills):
    data_a.insert(i+1+it+shift, float(nan))

data_a[i+1+shift:i+1+shift] = [float(nan)] * int(num_fills)
这可能会使它稍微快一点,因为分配更少,移动的项目也更少

然后,对于大型数值问题,始终使用。学习可能需要一些努力,但性能可能会提高几个数量级。从以下内容开始:

import numpy as np

series = np.array([1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1])
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

d_max = 1.0    # Normal increment in series where no gaps shall be filled
shift = 0

# the following two statements use NumPy's broadcasting
# to implicit run some loop at the C level
diff = series[1:] - series[:-1]
num_fills = np.round(diff / d_max) - 1
for i in np.where(diff > d_max)[0]:
    nf = num_fills[i]
    nans = [np.nan] * nf
    data_a[i+1+shift:i+1+shift] = nans
    data_b[i+1+shift:i+1+shift] = nans
    shift = int(shift + nf)
from itertools import izip

series      = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a      = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b      = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

def fillGaps(series,data_a,data_b,d_max=1.0):
  prev = None
  for s, a, b in izip(series,data_a,data_b):
    if prev is not None:
      diff = s - prev
      if s - prev > d_max:
        for x in xrange(int(round(diff/d_max))-1):
          yield (float('nan'),float('nan'))
    prev = s
    yield (a,b)

newA = []
newB = []
for a,b in fillGaps(series,data_a,data_b):
  newA.append(a)
  newB.append(b)

IIRC,插入到python列表是昂贵的,与列表的大小有关

我建议不要将庞大的数据集加载到内存中,而是使用生成器函数进行迭代,例如:

import numpy as np

series = np.array([1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1])
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

d_max = 1.0    # Normal increment in series where no gaps shall be filled
shift = 0

# the following two statements use NumPy's broadcasting
# to implicit run some loop at the C level
diff = series[1:] - series[:-1]
num_fills = np.round(diff / d_max) - 1
for i in np.where(diff > d_max)[0]:
    nf = num_fills[i]
    nans = [np.nan] * nf
    data_a[i+1+shift:i+1+shift] = nans
    data_b[i+1+shift:i+1+shift] = nans
    shift = int(shift + nf)
from itertools import izip

series      = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a      = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b      = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

def fillGaps(series,data_a,data_b,d_max=1.0):
  prev = None
  for s, a, b in izip(series,data_a,data_b):
    if prev is not None:
      diff = s - prev
      if s - prev > d_max:
        for x in xrange(int(round(diff/d_max))-1):
          yield (float('nan'),float('nan'))
    prev = s
    yield (a,b)

newA = []
newB = []
for a,b in fillGaps(series,data_a,data_b):
  newA.append(a)
  newB.append(b)

例如,将数据读入izip并将其写出,而不是列表附录。

IIRC,插入python列表的代价很高,与列表的大小有关

我建议不要将庞大的数据集加载到内存中,而是使用生成器函数进行迭代,例如:

import numpy as np

series = np.array([1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1])
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

d_max = 1.0    # Normal increment in series where no gaps shall be filled
shift = 0

# the following two statements use NumPy's broadcasting
# to implicit run some loop at the C level
diff = series[1:] - series[:-1]
num_fills = np.round(diff / d_max) - 1
for i in np.where(diff > d_max)[0]:
    nf = num_fills[i]
    nans = [np.nan] * nf
    data_a[i+1+shift:i+1+shift] = nans
    data_b[i+1+shift:i+1+shift] = nans
    shift = int(shift + nf)
from itertools import izip

series      = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a      = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b      = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]

def fillGaps(series,data_a,data_b,d_max=1.0):
  prev = None
  for s, a, b in izip(series,data_a,data_b):
    if prev is not None:
      diff = s - prev
      if s - prev > d_max:
        for x in xrange(int(round(diff/d_max))-1):
          yield (float('nan'),float('nan'))
    prev = s
    yield (a,b)

newA = []
newB = []
for a,b in fillGaps(series,data_a,data_b):
  newA.append(a)
  newB.append(b)

例如,将数据读入izip并写出,而不是列表附录。

您的访问模式是什么?如果不需要对结果进行急切的评估,您可以惰性地生成
[(系列,a,b)]
,这可能会更快,具体取决于您如何使用它。或者,如果您需要随机访问,像实现
\uuu getitem\uuu
的类这样的容器可以做一些更聪明的事情。您的访问模式是什么?如果不需要对结果进行急切的评估,您可以惰性地生成
[(系列,a,b)]
,这可能会更快,具体取决于您如何使用它。或者,如果您需要随机访问,像实现
\uuu getitem\uuu
的类这样的容器可以做一些聪明的事情