输入数据的正则表达式处理,随后使用Python和直方图进行可视化

输入数据的正则表达式处理,随后使用Python和直方图进行可视化,python,histogram,data-visualization,Python,Histogram,Data Visualization,目前,我有数以万计的以下形式的记录: 0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000000 82557 0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000001 128805 0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000002 94990 0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000003 1210

目前,我有数以万计的以下形式的记录:

0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000000   82557
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000001   128805
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000002   94990
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000003   121020
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000004   58111390
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000005   167079
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000006   130795
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000007   236926
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000008   24754217
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000009   75407
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000010   136461
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000011   136748
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000012   146258
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000013   381091
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000014   129815
在简单的电子表格程序中,将一些记录的数据可视化是很简单的,如下所示:

我一直在尝试调整此代码以使其可视化,但迄今为止-未成功:

# Call like this:
# 
# python opcode-farmer.py 'tst21' '6005600401'
# 
import re
import numpy as np
import matplotlib.pyplot as plt
import csv
import sys
import pprint
import itertools 
import subprocess
import collections

def my_test_func(filename, data):
    with open(filename, 'w') as fd:
        fd.write(data)
        fd.write('\n')
    return subprocess.check_output(['evm', 'disasm', filename])

if '__main__' == __name__:

    file_name = sys.argv[1] 
    byte_code = sys.argv[2]
    status = my_test_func(file_name, byte_code)

    opcodes_list = list()

    for element in status.split('\n'):
        result = re.search(r"\b[A-Z].+", element)
        if result:
            # eliminate individual 0x05 specification 
            simple_opcode = re.sub(r'\s(.*)', '', result.group(0))
            opcodes_list.append(simple_opcode)

    # Count up the values
    cnt = collections.Counter()
    for word in opcodes_list:
         cnt[word] += 1
    print(cnt)

    # THRESHOLD
    threshold = 30
    cnt = collections.Counter(record for record in cnt.elements() if cnt[record] >= threshold)


    # VISUALIZATION

    # Transpose the data to get the x and y values
    labels, values = zip(*cnt.items())


    # generates this representation: [0 1 2 3 4 5 6 7], 
    # from the number of the length
    indexes = np.arange(len(labels))
    width = 1

    plt.xlabel("most common opcodes in tx")
    plt.ylabel("number of occurances")

    plt.bar(indexes, values, width)
    plt.xticks(indexes + width * 0.5, labels)
    plt.show()
我如何迭代上面指定的输入记录,以消除0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_的前缀,然后在Python中将它们呈现为直方图

您可以尝试以下方法:

import re
data = [b for b in [re.split("\s+", i.strip('\n')) for i in open('filename.txt')] if len(b) > 1]
final_data = [[int(re.sub("\w+_", '', a)), int(b)] for a, b in data]
我根据您提供的数据运行了这段代码,得到了以下输出:

[[0, 82557], [1, 128805], [2, 94990], [3, 121020], [4, 58111390], [5, 167079], [6, 130795], [7, 236926], [8, 24754217], [9, 75407], [10, 136461], [11, 136748], [12, 146258], [13, 381091], [14, 129815]]
把这一切放在一起

import re
import re
import numpy as np
import matplotlib.pyplot as plt
import csv
import sys
import pprint
import itertools 
import subprocess
import collections


data = [b for b in [re.split("\s+", i.strip('\n')) for i in open('40000_output.txt')] if len(b) > 1]
final_data = [[int(re.sub("\w+_", '', a)), int(b)] for a, b in data]


# VISUALIZATION

# Transpose the data to get the x and y values
labels, values = zip(*final_data)


# generates this representation: [0 1 2 3 4 5 6 7], 
# from the number of the length
indexes = np.arange(len(labels))
width = 1

plt.xlabel("most common opcodes in tx")
plt.ylabel("number of occurances")

plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()

它本身不是结构化的,只是以我在OP中发布的形式。你是说我应该将其编码为json吗?@s.matthew.english在这种情况下,数据是一个完整的字符串,还是一个字符串列表?它是map reduce作业的输出,只是cat'd到.txtfile@s.matthew.english谢谢你的澄清。请查看我最近的编辑。但是如何将其呈现为直方图?…尝试调整此代码…-你是说你找到了这个,但不是你写的?您想使用所有数据还是仅使用其中的一部分?所有数据。是的,我写的示例数据是一条记录还是十五条记录?每条生产线的_uu000003零件是否唯一?它们是x数据吗?你想要一个有上万条的条形图还是一个有有限个箱子的柱状图,这些箱子聚集了每行的82557部分