包含python中所有项的重复子序列_Python_Python 2.7_Machine Learning_Data Mining_Subsequence

包含python中所有项的重复子序列

python python-2.7 machine-learning

包含python中所有项的重复子序列,python,python-2.7,machine-learning,data-mining,subsequence,Python,Python 2.7,Machine Learning,Data Mining,Subsequence,想象一下，我们有这样一个列表 [255,7,0,0255,7,0,0255,7,0,0255,7,0,0] 我们希望找到包含子序列中所有项的最短公共子序列（而不是子字符串），在这种情况下，子序列将是255,7,0,0，但我们不知道模式的长度 P>程序即使在中间有一些类似这个序列的重写也应该工作。code>255,7,0,0,4,3255,5,6,7,0,0255,7,0,0255,7,0,0255,7,0,1,2,0，它应该返回重复的子序列，即255,7,0,0 我尝试了最长公共子序列，但由于

想象一下，我们有这样一个列表

[255,7,0,0255,7,0,0255,7,0,0255,7,0,0]

我们希望找到包含子序列中所有项的最短公共子序列（而不是子字符串），在这种情况下，子序列将是

255,7,0,0

，但我们不知道模式的长度

<> P>程序即使在中间有一些类似这个序列的重写也应该工作。code>255,7,0,0,4,3255,5,6,7,0,0255,7,0,0255,7,0,0255,7,0,1,2,0，它应该返回重复的子序列，即

255,7,0,0

我尝试了最长公共子序列，但由于该算法是贪婪的，因此在这种情况下不起作用，因为它将返回所有匹配项，而不是最短的匹配项。非常感谢你的帮助

import numpy as np
cimport numpy as np
from libc.stdlib cimport *
from clcs cimport *
np.import_array()
def lcs_std(x, y):

"""Standard Longest Common Subsequence (LCS)
algorithm as described in [Cormen01]_.Davide Albanese
The elements of sequences must be coded as integers.

:Parameters:
   x : 1d integer array_like object (N)
      first sequence
   y : 1d integer array_like object (M)
      second sequence
:Returns:
   length : integer
      length of the LCS of x and y
   path : tuple of two 1d numpy array (path_x, path_y)
      path of the LCS
"""

cdef np.ndarray[np.int_t, ndim=1] x_arr
cdef np.ndarray[np.int_t, ndim=1] y_arr
cdef np.ndarray[np.int_t, ndim=1] px_arr
cdef np.ndarray[np.int_t, ndim=1] py_arr
cdef char **b
cdef int i
cdef Path p
cdef int length

x_arr = np.ascontiguousarray(x, dtype=np.int)
y_arr = np.ascontiguousarray(y, dtype=np.int)

b = <char **> malloc ((x_arr.shape[0]+1) * sizeof(char *))
for i in range(x_arr.shape[0]+1):
    b[i] = <char *> malloc ((y_arr.shape[0]+1) * sizeof(char))    

length = std(<long *> x_arr.data, <long *> y_arr.data, b,
              <int> x_arr.shape[0], <int> y_arr.shape[0])

trace(b, <int> x_arr.shape[0], <int> y_arr.shape[0], &p)

for i in range(x_arr.shape[0]+1):
    free (b[i])
free(b)

px_arr = np.empty(p.k, dtype=np.int)
py_arr = np.empty(p.k, dtype=np.int)

for i in range(p.k):
     px_arr[i] = p.px[i]
     py_arr[i] = p.py[i]

free (p.px)
free (p.py)

return length, (px_arr, py_arr)

将numpy导入为np
cimport numpy作为np
从libc.stdlib cimport*
从clcs cimport*
np.import_数组（）
def lcs_标准（x，y）：
“”“标准最长公共子序列（LCS）
算法如[Cormen01]所述
序列的元素必须编码为整数。
：参数：
x:1d整数数组类对象（N）
第一序列
y:1d整数数组类对象（M）
第二序列
：返回：
长度：整数
x和y的LCS的长度
路径：两个一维numpy数组的元组（路径x，路径y）
LCS的路径
"""
cdef np.ndarray[np.int_t，ndim=1]x_arr
cdef np.ndarray[np.int，ndim=1]y\u arr
cdef np.ndarray[np.int_t，ndim=1]px_arr
cdef np.ndarray[np.int\u t，ndim=1]py\u arr
cdef字符**b
cdef int i
cdef路径p
cdef整数长度
x_arr=np.ascontiguousarray（x，dtype=np.int）
y_arr=np.ascontiguousarray（y，dtype=np.int）
b=malloc（（x_arr.shape[0]+1）*sizeof（char*））
对于范围内的i（x_阵列形状[0]+1）：
b[i]=malloc（（y_arr.shape[0]+1）*sizeof（char））
长度=标准（x_arr.data，y_arr.data，b，
x轴角形状[0]，y轴角形状[0]）
轨迹（b，x_arr.shape[0]，y_arr.shape[0]，&p）
对于范围内的i（x_阵列形状[0]+1）：
免费（b[i]）
免费（b）
px_arr=np.empty（p.k，dtype=np.int）
py_arr=np.empty（p.k，dtype=np.int）
对于范围内的i（p.k）：
px_arr[i]=p.px[i]
py_arr[i]=p.py[i]
免费（p.px）
免费（p.py）
返回长度（px_arr，py_arr）

看一看

您似乎已经在序列中重新创建了频繁项集，但我认为有十几种算法可以实现这一点

韩，J。；郑，H。；辛，D。；严，X.（2007）。“频繁模式挖掘：现状和未来方向”。数据挖掘和知识发现15（1）：55–86

请编辑您的问题以包含您尝试的代码。您第二段中带有“gibberish”的示例不会识别255,7,0,0，因为此序列不会包含序列中的所有项目（第一段中的要求）。此外，如果

[255,7,0,0]

在

[255,5,6,7,0]

，为什么

[255]

不是一个“子序列”，而且显然是一个较短的序列？@dbliss不是整个序列中最长的一个？或者至少一半的序列，如果“common”的意思是“repeated”？@dbliss不，乱说不会改变它，仍然是整个/半序列。表示连续的元素，仅表示按顺序排列的元素，可以跳过。他特别提到了子序列，而不是子序列。