Python 计算非实数数据的增量熵_Python_Python 3.x_Math_Python 3.5_Entropy

Python 计算非实数数据的增量熵

python python-3.x math

Python 计算非实数数据的增量熵,python,python-3.x,math,python-3.5,entropy,Python,Python 3.x,Math,Python 3.5,Entropy,我有一组具有ID、时间戳和标识符的数据。我必须通过它，计算熵，并为数据保存一些其他链接。在每一步，更多的标识符被添加到标识符字典中，我必须重新计算熵并附加它。我有大量的数据，由于每一步后标识符数量和熵计算的增加，程序被卡住了。我读了下面的解决方案，但它是关于由数字组成的数据。我从这个页面复制了两个函数，熵的增量计算在每一步给出的值都不同于经典的全熵计算。以下是我的代码： from math import log # -------------------------------------

我有一组具有ID、时间戳和标识符的数据。我必须通过它，计算熵，并为数据保存一些其他链接。在每一步，更多的标识符被添加到标识符字典中，我必须重新计算熵并附加它。我有大量的数据，由于每一步后标识符数量和熵计算的增加，程序被卡住了。我读了下面的解决方案，但它是关于由数字组成的数据。

我从这个页面复制了两个函数，熵的增量计算在每一步给出的值都不同于经典的全熵计算。以下是我的代码：

from math import log
# ---------------------------------------------------------------------#
# Functions copied from  https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0

# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
    S = S1+S2
    return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)

# compute entropy using the classic equation
def entropy(L):
    n = 1.0*sum(L)
    return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = []  # Classical way of calculating entropy at every step
updated_entropies = []  # Incremental way of calculating entropy at every step
for item in input_data:
    temp = item[2].split(",")
    identifiers_sum = sum(total_identifiers.values())  # Sum of all identifiers
    old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1]  # Get previous entropy calculation
    for identifier in temp:
        S_new = len(temp)  # sum of new samples
        temp_dictionaty = {a:1 for a in temp}  # Store current identifiers and their occurrence
        if identifier not in total_identifiers:
            total_identifiers[identifier] = 1
        else:
            total_identifiers[identifier] += 1
    current_entropy = entropy(total_identifiers.values())  # Entropy for current set of identifiers
    updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
    updated_entropies.append(updated_entropy)

    entropy_value = entropy(total_identifiers.values())  # Classical entropy calculation for comparison. This step becomes too expensive with big data
    all_entropies.append(entropy_value)

print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum)  # Gives 12 while the sum is 14 ???
print("All Classical Entropies:     ", all_entropies)  # print for comparison
print("All Updated Entropies:       ", updated_entropies)

另一个问题是，当我打印“总标识符之和”时，它给出的是12，而不是14！（由于数据量非常大，我逐行读取实际文件，并将结果直接写入磁盘，而不将其存储在除标识符字典之外的内存中）。

上述代码使用定理4；在我看来，你想用定理5来代替（下一段的文章）

但是，请注意，如果标识符的数量确实是问题所在，那么下面的增量方法也不会起作用——在某个时候，字典会变得太大

在下面，您可以找到一个概念验证Python实现，该实现遵循中的描述

感谢@blazs提供熵持有者类。这就解决了问题。因此，我们的想法是从（）导入entropy_holder.py，并使用它来存储以前的熵，并在每一步出现新标识符时进行更新

因此，最低工作代码如下所示：

import entropy_holder

input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]

entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
    for identifier in item[2].split(","):
        entropy.update([entropy_holder.CountChange(identifier, 1)])

print(entropy.entropy())

使用Blaz增量公式计算的熵与经典方法计算的熵非常接近，避免了对所有数据的反复迭代

import entropy_holder

input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]

entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
    for identifier in item[2].split(","):
        entropy.update([entropy_holder.CountChange(identifier, 1)])

print(entropy.entropy())