Python 可预测性上限_Python_Entropy

Python 可预测性上限

python

Python 可预测性上限,python,entropy,Python,Entropy,我正试图计算我的入住率数据集的可预测性上限，就像宋的《人类流动的可预测性极限》论文中所说的那样。基本上，家（=1）和不在家（=0）代表宋的论文中访问的位置（塔）我在一个随机的二进制序列上测试了我的代码（我从and导出），该序列应该返回1的熵和0.5的可预测性。相反，返回的熵为0.87，可预测性为0.71 这是我的密码： import numpy as np from scipy.optimize import fsolve from cmath import log import math

我正试图计算我的入住率数据集的可预测性上限，就像宋的《人类流动的可预测性极限》论文中所说的那样。基本上，家（=1）和不在家（=0）代表宋的论文中访问的位置（塔）

我在一个随机的二进制序列上测试了我的代码（我从and导出），该序列应该返回1的熵和0.5的可预测性。相反，返回的熵为0.87，可预测性为0.71

这是我的密码：

import numpy as np
from scipy.optimize import fsolve
from cmath import log 
import math

def matchfinder(data):
    data_len = len(data)    
    output = np.zeros(len(data))
    output[0] = 1

    # Using L_{n} definition from
    #"Nonparametric Entropy Estimation for Stationary Process and Random Fields, with Applications to English Text"
    # by Kontoyiannis et. al.
    # $L_{n} = 1 + max \{l :0 \leq l \leq n, X^{l-1}_{0} = X^{-j+l-1}_{-j} \text{ for some } l \leq j \leq n \}$

    # for each position, i, in the sub-sequence that occurs before the current position, start_idx
    # check to see the maximum continuously equal string we can make by simultaneously extending from i and start_idx

    for start_idx in range(1,data_len):
        max_subsequence_matched = 0
        for i in range(0,start_idx):
            #    for( int i = 0; i < start_idx; i++ )
            #    {
            j = 0

            #increase the length of the substring starting at j and start_idx
            #while they are the same keeping track of the length
            while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ):
                j = j + 1

            if j > max_subsequence_matched:     
                max_subsequence_matched = j;

        #L_{n} is obtained by adding 1 to the longest match-length
        output[start_idx] = max_subsequence_matched + 1;    

    return output

if __name__ == '__main__':
    #Read dataset            
    data = np.random.randint(2,size=2000)

    #Number of distinct locations
    N = len(np.unique(data))

    #True entropy
    lambdai = matchfinder(data)
    Etrue = math.pow(sum( [ lambdai[i] / math.log(i+1,2) for i in range(1,len(data))] ) * (1.0/len(data)),-1)

    S = Etrue
    #use Fano's inequality to compute the predictability
    func = lambda x: (-(x*log(x,2).real+(1-x)*log(1-x,2).real)+(1-x)*log(N-1,2).real ) - S 
    ub = fsolve(func, 0.9)[0]
    print ub

将numpy导入为np
从scipy.optimize导入fsolve
从cmath导入日志
输入数学
def matchfinder（数据）：
数据长度=长度（数据）
输出=np.零（len（数据））
输出[0]=1
#使用来自的L_{n}定义
#平稳过程和随机场的非参数熵估计及其在英文文本中的应用
#Kontoyiannis等人。
#$L{n}=1+max\{L:0\leq L\leq n，X^{L-1}{0}=X^{-j+L-1}{-j}\text{for some}L\leq j\leq n}$
#对于每个位置，i，在当前位置之前发生的子序列中，启动_idx
#检查并查看通过同时从i和start_idx扩展可以生成的最大连续相等字符串
对于范围内的起始idx（1，数据长度）：
最大子序列匹配=0
对于范围内的i（0，开始\u idx）：
#对于（int i=0；i最大子序列匹配：
最大子序列匹配=j；
#L_{n}是通过在最长匹配长度上加1得到的
输出[start_idx]=最大子序列匹配+1；
返回输出
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
#读取数据集
数据=np.random.randint（2，大小=2000）
#不同位置的数量
N=长度（np.唯一（数据））
#真熵
lambdai=匹配查找器（数据）
Etrue=math.pow（范围（1，len（数据））]内i的和（[lambdai[i]/math.log（i+1,2））*（1.0/len（数据）），-1）
S=Etrue
#使用Fano不等式计算可预测性
func=lambda x:（（x*log（x，2）.real+（1-x）*log（1-x，2）.real）+（1-x）*log（N-1,2）.real）-S
ub=fsolve（func，0.9）[0]
打印ub

matchfinder函数通过查找最长的匹配找到熵，并将其加1（=之前未看到的最短子串）。然后使用Fano不等式计算可预测性

有什么问题吗

谢谢！

熵函数似乎是错的。参考Song，C.，Qu，Z.，Blumm，N.，和Barabási，A.L.（2010），《人类流动的可预测性限制》，科学，327（5968），1018–1021。你提到过，真实熵是通过基于Lempel-Ziv数据压缩的算法估算的：

在代码中，它将如下所示：

Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real

[1 1 2 0 0 0]

其中n是时间序列的长度

请注意，我们使用的对数基数与给定公式中的不同。然而，由于Fano不等式中的对数基数为2，因此使用相同的基数进行熵计算似乎是合乎逻辑的。此外，我不确定为什么从第一个索引开始求和，而不是从零索引开始

现在将其包装到函数中，例如：

def solve(locations, size):
    data = np.random.randint(locations,size=size)
    N = len(np.unique(data))
    n = float(len(data))
    print "Distinct locations: %i" % N
    print "Time series length: %i" % n

    #True entropy
    lambdai = matchfinder(data)
    #S = math.pow(sum([lambdai[i] / math.log(i + 1, 2) for i in range(1, len(data))]) * (1.0 / len(data)), -1)
    Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real
    S = Etrue
    print "Maximum entropy: %2.5f" % log(locations,2).real
    print "Real entropy: %2.5f" % S

    func = lambda x: (-(x * log(x, 2).real + (1 - x) * log(1 - x, 2).real) + (1 - x) * log(N - 1, 2).real) - S
    ub = fsolve(func, 0.9)[0]
    print "Upper bound of predictability: %2.5f" % ub
    return ub

两个位置的输出

Distinct locations: 2
Time series length: 10000
Maximum entropy: 1.00000
Real entropy: 1.01441
Upper bound of predictability: 0.50013

Distinct locations: 3
Time series length: 10000
Maximum entropy: 1.58496
Real entropy: 1.56567
Upper bound of predictability: 0.41172

三个位置的输出

Distinct locations: 2
Time series length: 10000
Maximum entropy: 1.00000
Real entropy: 1.01441
Upper bound of predictability: 0.50013

Distinct locations: 3
Time series length: 10000
Maximum entropy: 1.58496
Real entropy: 1.56567
Upper bound of predictability: 0.41172

当n接近无穷大时，Lempel-Ziv压缩收敛到实熵，这就是为什么在2个位置的情况下，它略高于最大极限

我也不确定您是否正确解释了lambda的定义。它被定义为“从位置I开始的最短子字符串的长度，之前从位置1到I-1没有出现”，因此，当我们到达某个点时，进一步的子字符串不再是唯一的，您的匹配算法将使它的长度始终比子字符串的长度高一个，而它应该相当于0，因为唯一的子字符串不存在

为了更清楚，让我们举一个简单的例子。如果位置数组如下所示：

[1 0 0 1 0 0]

然后我们可以看到，在前三个位置之后，模式再次重复。这意味着从第四个位置开始，最短的唯一子串不存在，因此它等于0。因此输出（λ）应如下所示：

Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real

[1 1 2 0 0 0]
但是，该情况下的函数将返回：

[1 1 2 4 3 2]
我重写了匹配函数来处理该问题：

def matchfinder2(data): data_len = len(data) output = np.zeros(len(data)) output[0] = 1 for start_idx in range(1,data_len): max_subsequence_matched = 0 for i in range(0,start_idx): j = 0 end_distance = data_len - start_idx #length left to the end of sequence (including current index) while( (start_idx+j < data_len) and (i+j < start_idx) and (data[i+j] == data[start_idx+j]) ): j = j + 1 if j == end_distance: #check if j has reached the end of sequence output[start_idx::] = np.zeros(end_distance) #if yes fill the rest of output with zeros return output #end function elif j > max_subsequence_matched: max_subsequence_matched = j; output[start_idx] = max_subsequence_matched + 1; return output

def matchfinder2（数据）：数据长度=长度（数据）输出=np.零（len（数据））输出[0]=1 对于范围内的起始idx（1，数据长度）：最大子序列匹配=0 对于范围内的i（0，开始\u idx）： j=0 end_distance=data_len-start_idx#序列末尾的左侧长度（包括当前索引）而（（start_idx+j最大子序列匹配：最大子序列匹配=j；输出[start_idx]=最大子序列匹配+1；返回输出

差异当然很小，因为结果只会在序列的一小部分发生变化。
熵函数似乎是错误的。参考Song，C.，Qu，Z.，Blumm，N.，和Barabási，A.L.（2010），《人类流动的可预测性限制》，科学，327（5968），1018–1021。你提到过，真实熵是通过基于Lempel-Ziv数据压缩的算法估算的：

在代码中，它将如下所示：

Etrue = math.pow((np.sum(lambdai)/ n),-1)*log(n,2).real

[1 1 2 0 0 0]
惠尔