Python 计算非实数数据的增量熵
我有一组具有ID、时间戳和标识符的数据。我必须通过它,计算熵,并为数据保存一些其他链接。在每一步,更多的标识符被添加到标识符字典中,我必须重新计算熵并附加它。我有大量的数据,由于每一步后标识符数量和熵计算的增加,程序被卡住了。我读了下面的解决方案,但它是关于由数字组成的数据。 我从这个页面复制了两个函数,熵的增量计算在每一步给出的值都不同于经典的全熵计算。 以下是我的代码:Python 计算非实数数据的增量熵,python,python-3.x,math,python-3.5,entropy,Python,Python 3.x,Math,Python 3.5,Entropy,我有一组具有ID、时间戳和标识符的数据。我必须通过它,计算熵,并为数据保存一些其他链接。在每一步,更多的标识符被添加到标识符字典中,我必须重新计算熵并附加它。我有大量的数据,由于每一步后标识符数量和熵计算的增加,程序被卡住了。我读了下面的解决方案,但它是关于由数字组成的数据。 我从这个页面复制了两个函数,熵的增量计算在每一步给出的值都不同于经典的全熵计算。 以下是我的代码: from math import log # -------------------------------------
from math import log
# ---------------------------------------------------------------------#
# Functions copied from https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = [] # Classical way of calculating entropy at every step
updated_entropies = [] # Incremental way of calculating entropy at every step
for item in input_data:
temp = item[2].split(",")
identifiers_sum = sum(total_identifiers.values()) # Sum of all identifiers
old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1] # Get previous entropy calculation
for identifier in temp:
S_new = len(temp) # sum of new samples
temp_dictionaty = {a:1 for a in temp} # Store current identifiers and their occurrence
if identifier not in total_identifiers:
total_identifiers[identifier] = 1
else:
total_identifiers[identifier] += 1
current_entropy = entropy(total_identifiers.values()) # Entropy for current set of identifiers
updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
updated_entropies.append(updated_entropy)
entropy_value = entropy(total_identifiers.values()) # Classical entropy calculation for comparison. This step becomes too expensive with big data
all_entropies.append(entropy_value)
print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum) # Gives 12 while the sum is 14 ???
print("All Classical Entropies: ", all_entropies) # print for comparison
print("All Updated Entropies: ", updated_entropies)
另一个问题是,当我打印“总标识符之和”时,它给出的是12,而不是14!(由于数据量非常大,我逐行读取实际文件,并将结果直接写入磁盘,而不将其存储在除标识符字典之外的内存中)。上述代码使用定理4;在我看来,你想用定理5来代替(下一段的文章) 但是,请注意,如果标识符的数量确实是问题所在,那么下面的增量方法也不会起作用——在某个时候,字典会变得太大 在下面,您可以找到一个概念验证Python实现,该实现遵循中的描述
感谢@blazs提供熵持有者类。这就解决了问题。因此,我们的想法是从()导入entropy_holder.py,并使用它来存储以前的熵,并在每一步出现新标识符时进行更新 因此,最低工作代码如下所示:
import entropy_holder
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
for identifier in item[2].split(","):
entropy.update([entropy_holder.CountChange(identifier, 1)])
print(entropy.entropy())
使用Blaz增量公式计算的熵与经典方法计算的熵非常接近,避免了对所有数据的反复迭代
import entropy_holder
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
for identifier in item[2].split(","):
entropy.update([entropy_holder.CountChange(identifier, 1)])
print(entropy.entropy())