Python 如何从频率字典创建二叉树

Python 如何从频率字典创建二叉树,python,huffman-code,Python,Huffman Code,我对编码相当陌生,我很难创建一个哈夫曼算法来编码和解码文本文件。我对大多数概念都非常了解,但对于如何创建和遍历树,却没有太多的具体内容 以下是我目前的代码: with open(input('enter a file: ')) as name: fh = name.read() print(fh) #create the frequency dicitonary freqdict = {} for ch in fh: if ch in freqdict:

我对编码相当陌生,我很难创建一个哈夫曼算法来编码和解码文本文件。我对大多数概念都非常了解,但对于如何创建和遍历树,却没有太多的具体内容

以下是我目前的代码:

with open(input('enter a file: ')) as name:
    fh = name.read()
    print(fh)

#create the frequency dicitonary
freqdict = {}
for ch in fh:
    if ch in freqdict:
        freqdict[ch] += 1
    else:
        freqdict[ch] = 1
freqdict = sorted(freqdict.items(), key = lambda x: 
x[1], reverse = True)
print(freqdict)

class Node:
    def __init__(self, left = None, right = None, 
data):
        self.left = left
        self.right = right
        self.data = data

    def children(self):
        return (self.left, self.right)

    def nodes(self):
        return (self.left, self.right)

    def __str__(self):
        return str(self.left, self.right)
修订版本:

这是一个哈夫曼编码器/解码器,用于“txt”中的任何消息

这会将txt消息编码为一个简短的二进制变量进行存储(您可以将压缩的\u二进制文件存储到磁盘。您还可以使用decompressHuffmanCode对压缩的\u二进制文件进行解码,这将从压缩的\u二进制文件的压缩字符串中重新创建原始字符串)

from heapq import heappush, heappop, heapify
from collections import defaultdict
from functools import reduce

def encode(symb2freq):
    heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()]
    heapify(heap)
    while len(heap) > 1:
        lo = heappop(heap)
        hi = heappop(heap)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    return dict(sorted(heappop(heap)[1:], key=lambda p: (p, len(p[-1]))))

# recreates the original message from your huffman code table 
# uncomment print(a) to see how it works
def decompressHuffmanCode(a, bit):
    # print(a)
    return ('', a[1] + s[a[0]+bit[0]]) if (a[0]+bit[0] in s) else (a[0]+bit[0], a[1])

txt="CompresssionIsCoolWithHuffman"

# Create symbol to frequency table
symb2freq = defaultdict(int)
for ch in txt:
    symb2freq[ch] += 1
enstr=encode(symb2freq)

# Create Huffman code table from frequency table
s=dict((v,k) for k,v in dict(enstr).items())

# Create compressible binary. We add 1 to the front, and remove it when read from disk
compressed_binary = '1' + ''.join([enstr[item] for item in txt])

# Read compressible binary so we can uncompress it. We strip the first bit.
read_compressed_binary = compressed_binary[1:]

# Recreate the compressed message from read_compressed_binary
remainder,bytestr = reduce(decompressHuffmanCode, read_compressed_binary, ('', ''))
print(bytestr)
其结果是:

CompresssionIsCoolWithHuffman

这是一个应该会有所帮助的快速实现。可以通过编程方式处理的是缓冲区,但我只是想向您展示一个使用频率代码的快速实现。我认为使用python字典结构来表示树和节点就足够了。实际上,您不需要单独的类

要初始化所有节点:

def huffman_tree(freq_dict):
    vals = freq_dict.copy()
    nodes = {}
    for n in vals.keys():
        nodes[n] = []
这里我们初始化了一个dictionary
nodes
来表示节点和叶子。让我们用数据填充它;在同一个函数中:

    while len(vals) > 1:
        s_vals = sorted(vals.items(), key=lambda x:x[1]) 
        a1 = s_vals[0][0]
        a2 = s_vals[1][0]
        vals[a1+a2] = vals.pop(a1) + vals.pop(a2)
        nodes[a1+a2] = [a1, a2]
    symbols = {} # this will keep our encoding-rules
    root = a1+a2 # a1 and a2 is our last visited data,
                 # therefore the two largest values
    tree = label_nodes(nodes, root, symbols)

    return symbols, tree
您可以看到,我现在首先对频率字典中的数据进行排序,升序.不过,在这样一个while循环中,迟一点而不是早一点进行操作,可以让您更自由地将哪个频率字典传递给您的程序。此外,我们在这里所做的是,在排序时从freq_dict中提取两个和两个项,将它们相加并存储在freq_dict中

现在,我们需要浏览我们的freq_dict,并构造某种符号字典,表示用于与符号交换文本的规则集。仍然在相同的函数中:

    while len(vals) > 1:
        s_vals = sorted(vals.items(), key=lambda x:x[1]) 
        a1 = s_vals[0][0]
        a2 = s_vals[1][0]
        vals[a1+a2] = vals.pop(a1) + vals.pop(a2)
        nodes[a1+a2] = [a1, a2]
    symbols = {} # this will keep our encoding-rules
    root = a1+a2 # a1 and a2 is our last visited data,
                 # therefore the two largest values
    tree = label_nodes(nodes, root, symbols)

    return symbols, tree
带有
tree=…
的行在这里可能看起来有点神奇,但这是因为我们还没有创建函数。但是想象一下,有一个函数递归地从根到叶遍历每个节点,添加一个表示编码符号的字符串前缀“0”或“1”(这就是我们按升序排序的原因,因此我们在顶部得到最频繁的单词,接收最小的编码符号):

这个函数就是这么做的。现在我们可以使用它了:

def huffman_encode(string, symbols):
    return ''.join([symbols[str(e)] for e in string])

text =  '''This is a simple text, made to illustrate how
        a huff-man encoder works. A huff-man encoder works
        best when the text is of reasonable length and has
        repeating patterns in its language.'''

fd = freq_dict(text)    
symbols, tree = huffman_tree(fd)    
huffe = huffman_encode(text, symbols)
print(huffe)
输出

解码是一个简单的遍历树的过程:

def huffman_decode(encoded, tree, string=False):  
    decoded = []
    i = 0
    while i < len(encoded):
        sym = encoded[i]
        label = tree[sym]
        # Continue untill leaf is reached
        while not isinstance(label, str):
            i += 1
            sym = encoded[i]
            label = label[sym]        
        decoded.append(label)        
        i += 1
    if string == True:
        return ''.join([e for e in decoded])
    return decoded

print(huffman_decode(huffe, tree, string=True))
def huffman_解码(编码,树,字符串=False):
解码=[]
i=0
而i
Out:这是一个简单的文本,用来说明 哈夫人编码器工作。哈夫人编码器工作 当文本长度合理且具有 在其语言中重复模式


这个答案在很大程度上是从我自己的GitHub中盗取的:

到目前为止,你发布的内容中没有任何可回答的问题……你想实现什么,为什么你发布的代码没有达到你想要的效果?(它在做什么?)@文章的结尾是:创建和遍历这棵树的步骤是什么?如果不给我一个简单的答案,就没有太多的东西可以清楚地描述它