Python 如何从频率字典创建二叉树
我对编码相当陌生,我很难创建一个哈夫曼算法来编码和解码文本文件。我对大多数概念都非常了解,但对于如何创建和遍历树,却没有太多的具体内容 以下是我目前的代码:Python 如何从频率字典创建二叉树,python,huffman-code,Python,Huffman Code,我对编码相当陌生,我很难创建一个哈夫曼算法来编码和解码文本文件。我对大多数概念都非常了解,但对于如何创建和遍历树,却没有太多的具体内容 以下是我目前的代码: with open(input('enter a file: ')) as name: fh = name.read() print(fh) #create the frequency dicitonary freqdict = {} for ch in fh: if ch in freqdict:
with open(input('enter a file: ')) as name:
fh = name.read()
print(fh)
#create the frequency dicitonary
freqdict = {}
for ch in fh:
if ch in freqdict:
freqdict[ch] += 1
else:
freqdict[ch] = 1
freqdict = sorted(freqdict.items(), key = lambda x:
x[1], reverse = True)
print(freqdict)
class Node:
def __init__(self, left = None, right = None,
data):
self.left = left
self.right = right
self.data = data
def children(self):
return (self.left, self.right)
def nodes(self):
return (self.left, self.right)
def __str__(self):
return str(self.left, self.right)
修订版本:
这是一个哈夫曼编码器/解码器,用于“txt”中的任何消息
这会将txt消息编码为一个简短的二进制变量进行存储(您可以将压缩的\u二进制文件存储到磁盘。您还可以使用decompressHuffmanCode对压缩的\u二进制文件进行解码,这将从压缩的\u二进制文件的压缩字符串中重新创建原始字符串)
from heapq import heappush, heappop, heapify
from collections import defaultdict
from functools import reduce
def encode(symb2freq):
heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()]
heapify(heap)
while len(heap) > 1:
lo = heappop(heap)
hi = heappop(heap)
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
return dict(sorted(heappop(heap)[1:], key=lambda p: (p, len(p[-1]))))
# recreates the original message from your huffman code table
# uncomment print(a) to see how it works
def decompressHuffmanCode(a, bit):
# print(a)
return ('', a[1] + s[a[0]+bit[0]]) if (a[0]+bit[0] in s) else (a[0]+bit[0], a[1])
txt="CompresssionIsCoolWithHuffman"
# Create symbol to frequency table
symb2freq = defaultdict(int)
for ch in txt:
symb2freq[ch] += 1
enstr=encode(symb2freq)
# Create Huffman code table from frequency table
s=dict((v,k) for k,v in dict(enstr).items())
# Create compressible binary. We add 1 to the front, and remove it when read from disk
compressed_binary = '1' + ''.join([enstr[item] for item in txt])
# Read compressible binary so we can uncompress it. We strip the first bit.
read_compressed_binary = compressed_binary[1:]
# Recreate the compressed message from read_compressed_binary
remainder,bytestr = reduce(decompressHuffmanCode, read_compressed_binary, ('', ''))
print(bytestr)
其结果是:
CompresssionIsCoolWithHuffman
这是一个应该会有所帮助的快速实现。可以通过编程方式处理的是缓冲区,但我只是想向您展示一个使用频率代码的快速实现。我认为使用python字典结构来表示树和节点就足够了。实际上,您不需要单独的类 要初始化所有节点:
def huffman_tree(freq_dict):
vals = freq_dict.copy()
nodes = {}
for n in vals.keys():
nodes[n] = []
这里我们初始化了一个dictionarynodes
来表示节点和叶子。让我们用数据填充它;在同一个函数中:
while len(vals) > 1:
s_vals = sorted(vals.items(), key=lambda x:x[1])
a1 = s_vals[0][0]
a2 = s_vals[1][0]
vals[a1+a2] = vals.pop(a1) + vals.pop(a2)
nodes[a1+a2] = [a1, a2]
symbols = {} # this will keep our encoding-rules
root = a1+a2 # a1 and a2 is our last visited data,
# therefore the two largest values
tree = label_nodes(nodes, root, symbols)
return symbols, tree
您可以看到,我现在首先对频率字典中的数据进行排序,升序.不过,在这样一个while循环中,迟一点而不是早一点进行操作,可以让您更自由地将哪个频率字典传递给您的程序。此外,我们在这里所做的是,在排序时从freq_dict中提取两个和两个项,将它们相加并存储在freq_dict中
现在,我们需要浏览我们的freq_dict,并构造某种符号字典,表示用于与符号交换文本的规则集。仍然在相同的函数中:
while len(vals) > 1:
s_vals = sorted(vals.items(), key=lambda x:x[1])
a1 = s_vals[0][0]
a2 = s_vals[1][0]
vals[a1+a2] = vals.pop(a1) + vals.pop(a2)
nodes[a1+a2] = [a1, a2]
symbols = {} # this will keep our encoding-rules
root = a1+a2 # a1 and a2 is our last visited data,
# therefore the two largest values
tree = label_nodes(nodes, root, symbols)
return symbols, tree
带有tree=…
的行在这里可能看起来有点神奇,但这是因为我们还没有创建函数。但是想象一下,有一个函数递归地从根到叶遍历每个节点,添加一个表示编码符号的字符串前缀“0”或“1”(这就是我们按升序排序的原因,因此我们在顶部得到最频繁的单词,接收最小的编码符号):
这个函数就是这么做的。现在我们可以使用它了:
def huffman_encode(string, symbols):
return ''.join([symbols[str(e)] for e in string])
text = '''This is a simple text, made to illustrate how
a huff-man encoder works. A huff-man encoder works
best when the text is of reasonable length and has
repeating patterns in its language.'''
fd = freq_dict(text)
symbols, tree = huffman_tree(fd)
huffe = huffman_encode(text, symbols)
print(huffe)
输出:001001011001111010001101110110110110100111101011001110011010010010000110110001100001101100100110111000101101110110010111101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101100100111000110010010010011100011001001001110001101101101101100101101101101101101101101101100111101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101100011100111101100101001011111100110001101010000100010001111101110111110100001110001111100110111110011111100011100111010101010101010100100000000111011011111001100100010000110111100100001101100011000011011011101000110111011001010111100000101000111101110001010001000100100000110010000100011110110111100101101010001110100111001101000111001110101010101010101111000000110000010101011111010100011110101100110011010101011101100011100100000110111010100011101010110101010110011011001010101000111111111111011110101111010001111111
解码是一个简单的遍历树的过程:
def huffman_decode(encoded, tree, string=False):
decoded = []
i = 0
while i < len(encoded):
sym = encoded[i]
label = tree[sym]
# Continue untill leaf is reached
while not isinstance(label, str):
i += 1
sym = encoded[i]
label = label[sym]
decoded.append(label)
i += 1
if string == True:
return ''.join([e for e in decoded])
return decoded
print(huffman_decode(huffe, tree, string=True))
def huffman_解码(编码,树,字符串=False):
解码=[]
i=0
而i
Out:这是一个简单的文本,用来说明
哈夫人编码器工作。哈夫人编码器工作
当文本长度合理且具有
在其语言中重复模式
这个答案在很大程度上是从我自己的GitHub中盗取的:到目前为止,你发布的内容中没有任何可回答的问题……你想实现什么,为什么你发布的代码没有达到你想要的效果?(它在做什么?)@文章的结尾是:创建和遍历这棵树的步骤是什么?如果不给我一个简单的答案,就没有太多的东西可以清楚地描述它