Python 计算非实数数据的增量熵

Python 计算非实数数据的增量熵,python,python-3.x,math,python-3.5,entropy,Python,Python 3.x,Math,Python 3.5,Entropy,我有一组具有ID、时间戳和标识符的数据。我必须通过它,计算熵,并为数据保存一些其他链接。在每一步,更多的标识符被添加到标识符字典中,我必须重新计算熵并附加它。我有大量的数据,由于每一步后标识符数量和熵计算的增加,程序被卡住了。我读了下面的解决方案,但它是关于由数字组成的数据。 我从这个页面复制了两个函数,熵的增量计算在每一步给出的值都不同于经典的全熵计算。 以下是我的代码: from math import log # -------------------------------------

我有一组具有ID、时间戳和标识符的数据。我必须通过它,计算熵,并为数据保存一些其他链接。在每一步,更多的标识符被添加到标识符字典中,我必须重新计算熵并附加它。我有大量的数据,由于每一步后标识符数量和熵计算的增加,程序被卡住了。我读了下面的解决方案,但它是关于由数字组成的数据。

我从这个页面复制了两个函数,熵的增量计算在每一步给出的值都不同于经典的全熵计算。 以下是我的代码:

from math import log
# ---------------------------------------------------------------------#
# Functions copied from  https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0

# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
    S = S1+S2
    return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)

# compute entropy using the classic equation
def entropy(L):
    n = 1.0*sum(L)
    return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = []  # Classical way of calculating entropy at every step
updated_entropies = []  # Incremental way of calculating entropy at every step
for item in input_data:
    temp = item[2].split(",")
    identifiers_sum = sum(total_identifiers.values())  # Sum of all identifiers
    old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1]  # Get previous entropy calculation
    for identifier in temp:
        S_new = len(temp)  # sum of new samples
        temp_dictionaty = {a:1 for a in temp}  # Store current identifiers and their occurrence
        if identifier not in total_identifiers:
            total_identifiers[identifier] = 1
        else:
            total_identifiers[identifier] += 1
    current_entropy = entropy(total_identifiers.values())  # Entropy for current set of identifiers
    updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
    updated_entropies.append(updated_entropy)

    entropy_value = entropy(total_identifiers.values())  # Classical entropy calculation for comparison. This step becomes too expensive with big data
    all_entropies.append(entropy_value)

print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum)  # Gives 12 while the sum is 14 ???
print("All Classical Entropies:     ", all_entropies)  # print for comparison
print("All Updated Entropies:       ", updated_entropies)

另一个问题是,当我打印“总标识符之和”时,它给出的是12,而不是14!(由于数据量非常大,我逐行读取实际文件,并将结果直接写入磁盘,而不将其存储在除标识符字典之外的内存中)。

上述代码使用定理4;在我看来,你想用定理5来代替(下一段的文章)

但是,请注意,如果标识符的数量确实是问题所在,那么下面的增量方法也不会起作用——在某个时候,字典会变得太大

在下面,您可以找到一个概念验证Python实现,该实现遵循中的描述


感谢@blazs提供熵持有者类。这就解决了问题。因此,我们的想法是从()导入entropy_holder.py,并使用它来存储以前的熵,并在每一步出现新标识符时进行更新

因此,最低工作代码如下所示:

import entropy_holder

input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]

entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
    for identifier in item[2].split(","):
        entropy.update([entropy_holder.CountChange(identifier, 1)])

print(entropy.entropy())
使用Blaz增量公式计算的熵与经典方法计算的熵非常接近,避免了对所有数据的反复迭代

import entropy_holder

input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]

entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
    for identifier in item[2].split(","):
        entropy.update([entropy_holder.CountChange(identifier, 1)])

print(entropy.entropy())