Algorithm 在序列数组中查找最佳匹配的数字序列

Algorithm 在序列数组中查找最佳匹配的数字序列,algorithm,math,dynamic-programming,graph-algorithm,Algorithm,Math,Dynamic Programming,Graph Algorithm,假设我有两个数组: float arr[] = {40.4357,40.6135,40.2477,40.2864,39.3449,39.8901,40.103,39.9959,39.7863,39.9102,39.2652,39.2688,39.5147,38.2246,38.5376,38.4512,38.9951,39.0999,39.3057,38.53,38.2761,38.1722,37.8816,37.6521,37.8306,38.0853,37.9644,38.0626,38.0

假设我有两个数组:

float arr[] = {40.4357,40.6135,40.2477,40.2864,39.3449,39.8901,40.103,39.9959,39.7863,39.9102,39.2652,39.2688,39.5147,38.2246,38.5376,38.4512,38.9951,39.0999,39.3057,38.53,38.2761,38.1722,37.8816,37.6521,37.8306,38.0853,37.9644,38.0626,38.0567,38.3518,38.4044,38.3553,38.4978,38.3768,38.2058,38.3175,38.3123,38.262,38.0093,38.3685,38.0111,38.4539,38.8122,39.1413,38.9409,39.2043,39.3538,39.4123,39.3628,39.2825,39.1898,39.0431,39.0634,38.5993,38.252,37.3793,36.6334,36.4009,35.2822,34.4262,34.2119,34.1552,34.3325,33.9626,33.2661,32.3819,35.1959,36.7602,37.9039,37.8103,37.5832,37.9718,38.3111,38.9323,38.6763,39.1163,38.8469,39.805,40.2627,40.3689,40.4064,40.0558,40.815,41.0234,41.0128,41.0296,41.0927,40.7046,40.6775,40.2711,40.1283,39.7518,40.0145,40.0394,39.8461,39.6317,39.5548,39.1996,38.9861,38.8507,38.8603,38.483,38.4711,38.4214,38.4286,38.5766,38.7532,38.7905,38.6029,38.4635,38.1403,36.6844,36.616,36.4053,34.7934,34.0226,33.0505,33.4978,34.6106,35.284,35.7535,35.3541,35.5481,35.4086,35.7096,36.0526,36.1222,35.9408,36.1007,36.7952,36.99,37.1024,37.0993,37.3144,36.6951,37.1213,38.0026,38.1266,39.2538,38.8963,39.0158,38.6235,38.7908,38.6041,38.4489,38.3207,37.7398,38.5304,38.925,38.7249,38.9221,39.1704,39.5113,40.0613,39.3602,39.8689,39.973,40.0524,40.0025,40.7584,40.9714,40.9106,40.9685,40.6554,39.7314,39.0044,38.7183,38.5163,38.6101,38.2004,38.7606,38.7532,37.8903,37.8403,38.5368,39.0462,38.8279,39.0748,39.2907,38.5447,38.423,38.5624,38.476,38.5784,39.0905,39.379,39.4739,39.5774,40.7036,40.3044,39.6162,39.9967,40.0562,39.3426,38.666,38.7561,39.2823,38.8548,37.6214,37.8188,38.1086,38.3619,38.5472,38.1357,38.1422,37.95,37.1837,37.4636,36.8852,37.1617,37.5051,37.7724,38.0879,37.7197,38.0422,37.8551,38.5688,38.8388};
float pattern[] = {38.6434,38.1409,37.3391,37.5457,37.7487,37.7499,37.6121,37.4789,37.5821,37.6541,38.0365,37.7907,37.9932,37.9945,37.7032,37.3556,37.6359,37.5412,37.5296,37.8829,38.3797,38.4452,39.0929,39.1233,39.3014,39.0317,38.903,38.8221,39.045,38.6944,39.0699,39.0978,38.9877,38.8123,38.7491,38.5888,38.7875,38.2086,37.7484,37.3961,36.8663,36.2607,35.8838,35.3297,35.5574,35.7239};
我上传了这个示例图:

如图中所示,模式几乎与索引17处的数组相匹配


找到这个索引最好最快的方法是什么?有没有一种方法可以让你确信这些值不等于你所看到的值?

如果起始索引是你唯一的自由度,你可以尝试每个索引并计算每个数据点的误差平方和。在Python中,这可能如下所示:

data=[40.4357,40.6135,40.2477,…]
模式=[38.6434,38.1409,37.3391,37.5457,37.7487,…]
最佳指示,最佳错误=0,1e9999
对于范围内的i(len(数据)-len(模式)):
子数据=数据[i:i+len(模式)]
err=总和((d-p)**2表示zip中的(d,p)(子数据,模式))
如果错误<最佳错误:
最佳指示,最佳错误=i,错误
结果:

>>打印最佳索引,最佳错误
17 21.27929269

简单的算法是选择一个收敛度量(如何描述相似性,这可能是误差的平均值,或它们的平方值或任何适合您的目的的其他函数)并应用步骤

  • i=0为整数索引,Msize=length(data)-length(pattern)+1的容器,用于存储测量值
  • 如果i,则将图案移动i,否则转到步骤5
  • 计算相似性度量并存储到M
  • i=i+1,转到2并重复
  • M中选择最小值的索引

  • 它是Python中的一行程序,使用元组按字典顺序排序的事实:

    In [1]:
    
    import numpy as np
    arr = np.array( [ 40.4357,40.6135,40.2477,40.2864,39.3449,39.8901,40.103,39.9959,39.7863,39.9102,39.2652,39.2688,39.5147,38.2246,38.5376,38.4512,38.9951,39.0999,39.3057,38.53,38.2761,38.1722,37.8816,37.6521,37.8306,38.0853,37.9644,38.0626,38.0567,38.3518,38.4044,38.3553,38.4978,38.3768,38.2058,38.3175,38.3123,38.262,38.0093,38.3685,38.0111,38.4539,38.8122,39.1413,38.9409,39.2043,39.3538,39.4123,39.3628,39.2825,39.1898,39.0431,39.0634,38.5993,38.252,37.3793,36.6334,36.4009,35.2822,34.4262,34.2119,34.1552,34.3325,33.9626,33.2661,32.3819,35.1959,36.7602,37.9039,37.8103,37.5832,37.9718,38.3111,38.9323,38.6763,39.1163,38.8469,39.805,40.2627,40.3689,40.4064,40.0558,40.815,41.0234,41.0128,41.0296,41.0927,40.7046,40.6775,40.2711,40.1283,39.7518,40.0145,40.0394,39.8461,39.6317,39.5548,39.1996,38.9861,38.8507,38.8603,38.483,38.4711,38.4214,38.4286,38.5766,38.7532,38.7905,38.6029,38.4635,38.1403,36.6844,36.616,36.4053,34.7934,34.0226,33.0505,33.4978,34.6106,35.284,35.7535,35.3541,35.5481,35.4086,35.7096,36.0526,36.1222,35.9408,36.1007,36.7952,36.99,37.1024,37.0993,37.3144,36.6951,37.1213,38.0026,38.1266,39.2538,38.8963,39.0158,38.6235,38.7908,38.6041,38.4489,38.3207,37.7398,38.5304,38.925,38.7249,38.9221,39.1704,39.5113,40.0613,39.3602,39.8689,39.973,40.0524,40.0025,40.7584,40.9714,40.9106,40.9685,40.6554,39.7314,39.0044,38.7183,38.5163,38.6101,38.2004,38.7606,38.7532,37.8903,37.8403,38.5368,39.0462,38.8279,39.0748,39.2907,38.5447,38.423,38.5624,38.476,38.5784,39.0905,39.379,39.4739,39.5774,40.7036,40.3044,39.6162,39.9967,40.0562,39.3426,38.666,38.7561,39.2823,38.8548,37.6214,37.8188,38.1086,38.3619,38.5472,38.1357,38.1422,37.95,37.1837,37.4636,36.8852,37.1617,37.5051,37.7724,38.0879,37.7197,38.0422,37.8551,38.5688,38.8388] )
    pattern = np.array( [ 38.6434,38.1409,37.3391,37.5457,37.7487,37.7499,37.6121,37.4789,37.5821,37.6541,38.0365,37.7907,37.9932,37.9945,37.7032,37.3556,37.6359,37.5412,37.5296,37.8829,38.3797,38.4452,39.0929,39.1233,39.3014,39.0317,38.903,38.8221,39.045,38.6944,39.0699,39.0978,38.9877,38.8123,38.7491,38.5888,38.7875,38.2086,37.7484,37.3961,36.8663,36.2607,35.8838,35.3297,35.5574,35.7239 ] )
    
    min( ( ( ( arr[i:i+len(pattern)] - pattern ) ** 2 ).mean(), i ) for i in xrange(len(arr)-len(pattern)) )
    
    Out[5]:
    (0.46259331934782588, 17) 
    

    其中
    0.46
    是最小均方误差,17是
    arr

    中最小值的位置,您对模式之间的“距离”的度量是什么?rms,mahal,欧几里德距离,还是别的?可能会给你一个想法,尽管这不是你真正需要的。您可能不需要移位近似,因此只需要一次传递。约束条件是什么?模式中会有漏洞吗,或者你只需要找到最佳的开始索引就可以了?Nico你能上传文章中的代码吗?microsoft驱动器显示文件丢失。如果有更多的自由度,例如,在将模式与数据匹配时可能存在“间隙”,则可以使用的变量,使用平方误差代替替换的固定成本。@user3794234那么,这是否适用于您,或者还有其他什么吗?这些算法只有在单元格中的数据是1:1时才匹配,但如果数据或模式发生了变化,但您仍然可以看到它们匹配,我上传了一个示例图:@user3794234不确定您的意思。如果你把图案向右移一点,那不是更合适吗?不确定,但最小编辑距离扩展可能会有所帮助。请用更多的例子和详细描述来扩展您的问题,说明您真正需要什么和不需要什么。上述解决方案将模式和数据进行比较,就好像它们的序列大小相同一样,正如您在我添加的新图表中所看到的,数据可能有偏移量,而且它们的大小也不会相同,Niko Schertler解决方案的偏移量和没有匹配的读数看起来不错,但代码缺失。