Python 如何从频率字典创建二叉树_Python_Huffman Code

Python 如何从频率字典创建二叉树

python

Python 如何从频率字典创建二叉树,python,huffman-code,Python,Huffman Code,我对编码相当陌生，我很难创建一个哈夫曼算法来编码和解码文本文件。我对大多数概念都非常了解，但对于如何创建和遍历树，却没有太多的具体内容以下是我目前的代码： with open(input('enter a file: ')) as name: fh = name.read() print(fh) #create the frequency dicitonary freqdict = {} for ch in fh: if ch in freqdict:

我对编码相当陌生，我很难创建一个哈夫曼算法来编码和解码文本文件。我对大多数概念都非常了解，但对于如何创建和遍历树，却没有太多的具体内容

以下是我目前的代码：

with open(input('enter a file: ')) as name:
    fh = name.read()
    print(fh)

#create the frequency dicitonary
freqdict = {}
for ch in fh:
    if ch in freqdict:
        freqdict[ch] += 1
    else:
        freqdict[ch] = 1
freqdict = sorted(freqdict.items(), key = lambda x: 
x[1], reverse = True)
print(freqdict)

class Node:
    def __init__(self, left = None, right = None, 
data):
        self.left = left
        self.right = right
        self.data = data

    def children(self):
        return (self.left, self.right)

    def nodes(self):
        return (self.left, self.right)

    def __str__(self):
        return str(self.left, self.right)

修订版本：

这是一个哈夫曼编码器/解码器，用于“txt”中的任何消息

这会将txt消息编码为一个简短的二进制变量进行存储（您可以将压缩的\u二进制文件存储到磁盘。您还可以使用decompressHuffmanCode对压缩的\u二进制文件进行解码，这将从压缩的\u二进制文件的压缩字符串中重新创建原始字符串）

from heapq import heappush, heappop, heapify
from collections import defaultdict
from functools import reduce

def encode(symb2freq):
    heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()]
    heapify(heap)
    while len(heap) > 1:
        lo = heappop(heap)
        hi = heappop(heap)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    return dict(sorted(heappop(heap)[1:], key=lambda p: (p, len(p[-1]))))

# recreates the original message from your huffman code table 
# uncomment print(a) to see how it works
def decompressHuffmanCode(a, bit):
    # print(a)
    return ('', a[1] + s[a[0]+bit[0]]) if (a[0]+bit[0] in s) else (a[0]+bit[0], a[1])

txt="CompresssionIsCoolWithHuffman"

# Create symbol to frequency table
symb2freq = defaultdict(int)
for ch in txt:
    symb2freq[ch] += 1
enstr=encode(symb2freq)

# Create Huffman code table from frequency table
s=dict((v,k) for k,v in dict(enstr).items())

# Create compressible binary. We add 1 to the front, and remove it when read from disk
compressed_binary = '1' + ''.join([enstr[item] for item in txt])

# Read compressible binary so we can uncompress it. We strip the first bit.
read_compressed_binary = compressed_binary[1:]

# Recreate the compressed message from read_compressed_binary
remainder,bytestr = reduce(decompressHuffmanCode, read_compressed_binary, ('', ''))
print(bytestr)

其结果是：

CompresssionIsCoolWithHuffman

这是一个应该会有所帮助的快速实现。可以通过编程方式处理的是缓冲区，但我只是想向您展示一个使用频率代码的快速实现。我认为使用python字典结构来表示树和节点就足够了。实际上，您不需要单独的类

要初始化所有节点：

def huffman_tree(freq_dict):
    vals = freq_dict.copy()
    nodes = {}
    for n in vals.keys():
        nodes[n] = []

这里我们初始化了一个dictionary

nodes

来表示节点和叶子。让我们用数据填充它；在同一个函数中：

    while len(vals) > 1:
        s_vals = sorted(vals.items(), key=lambda x:x[1]) 
        a1 = s_vals[0][0]
        a2 = s_vals[1][0]
        vals[a1+a2] = vals.pop(a1) + vals.pop(a2)
        nodes[a1+a2] = [a1, a2]

    symbols = {} # this will keep our encoding-rules
    root = a1+a2 # a1 and a2 is our last visited data,
                 # therefore the two largest values
    tree = label_nodes(nodes, root, symbols)

    return symbols, tree

您可以看到，我现在首先对频率字典中的数据进行排序，升序.不过，在这样一个while循环中，迟一点而不是早一点进行操作，可以让您更自由地将哪个频率字典传递给您的程序。此外，我们在这里所做的是，在排序时从freq_dict中提取两个和两个项，将它们相加并存储在freq_dict中
现在，我们需要浏览我们的freq_dict，并构造某种符号字典，表示用于与符号交换文本的规则集。仍然在相同的函数中：

while len(vals) > 1: s_vals = sorted(vals.items(), key=lambda x:x[1]) a1 = s_vals[0][0] a2 = s_vals[1][0] vals[a1+a2] = vals.pop(a1) + vals.pop(a2) nodes[a1+a2] = [a1, a2]

symbols = {} # this will keep our encoding-rules root = a1+a2 # a1 and a2 is our last visited data, # therefore the two largest values tree = label_nodes(nodes, root, symbols) return symbols, tree
带有
tree=…
的行在这里可能看起来有点神奇，但这是因为我们还没有创建函数。但是想象一下，有一个函数递归地从根到叶遍历每个节点，添加一个表示编码符号的字符串前缀“0”或“1”（这就是我们按升序排序的原因，因此我们在顶部得到最频繁的单词，接收最小的编码符号）：
这个函数就是这么做的。现在我们可以使用它了：

def huffman_encode(string, symbols): return ''.join([symbols[str(e)] for e in string]) text = '''This is a simple text, made to illustrate how a huff-man encoder works. A huff-man encoder works best when the text is of reasonable length and has repeating patterns in its language.''' fd = freq_dict(text) symbols, tree = huffman_tree(fd) huffe = huffman_encode(text, symbols) print(huffe)
输出：001001011001111010001101110110110110100111101011001110011010010010000110110001100001101100100110111000101101110110010111101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101100100111000110010010010011100011001001001110001101101101101100101101101101101101101101101100111101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101101100011100111101100101001011111100110001101010000100010001111101110111110100001110001111100110111110011111100011100111010101010101010100100000000111011011111001100100010000110111100100001101100011000011011011101000110111011001010111100000101000111101110001010001000100100000110010000100011110110111100101101010001110100111001101000111001110101010101010101111000000110000010101011111010100011110101100110011010101011101100011100100000110111010100011101010110101010110011011001010101000111111111111011110101111010001111111
解码是一个简单的遍历树的过程：

def huffman_decode(encoded, tree, string=False): decoded = [] i = 0 while i < len(encoded): sym = encoded[i] label = tree[sym] # Continue untill leaf is reached while not isinstance(label, str): i += 1 sym = encoded[i] label = label[sym] decoded.append(label) i += 1 if string == True: return ''.join([e for e in decoded]) return decoded print(huffman_decode(huffe, tree, string=True))

def huffman_解码（编码，树，字符串=False）：解码=[] i=0 而i
Out：这是一个简单的文本，用来说明哈夫人编码器工作。哈夫人编码器工作当文本长度合理且具有在其语言中重复模式这个答案在很大程度上是从我自己的GitHub中盗取的：到目前为止，你发布的内容中没有任何可回答的问题……你想实现什么，为什么你发布的代码没有达到你想要的效果？（它在做什么？）@文章的结尾是：创建和遍历这棵树的步骤是什么？如果不给我一个简单的答案，就没有太多的东西可以清楚地描述它