输入数据的正则表达式处理,随后使用Python和直方图进行可视化
目前,我有数以万计的以下形式的记录:输入数据的正则表达式处理,随后使用Python和直方图进行可视化,python,histogram,data-visualization,Python,Histogram,Data Visualization,目前,我有数以万计的以下形式的记录: 0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000000 82557 0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000001 128805 0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000002 94990 0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000003 1210
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000000 82557
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000001 128805
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000002 94990
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000003 121020
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000004 58111390
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000005 167079
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000006 130795
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000007 236926
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000008 24754217
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000009 75407
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000010 136461
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000011 136748
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000012 146258
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000013 381091
0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_000014 129815
在简单的电子表格程序中,将一些记录的数据可视化是很简单的,如下所示:
我一直在尝试调整此代码以使其可视化,但迄今为止-未成功:
# Call like this:
#
# python opcode-farmer.py 'tst21' '6005600401'
#
import re
import numpy as np
import matplotlib.pyplot as plt
import csv
import sys
import pprint
import itertools
import subprocess
import collections
def my_test_func(filename, data):
with open(filename, 'w') as fd:
fd.write(data)
fd.write('\n')
return subprocess.check_output(['evm', 'disasm', filename])
if '__main__' == __name__:
file_name = sys.argv[1]
byte_code = sys.argv[2]
status = my_test_func(file_name, byte_code)
opcodes_list = list()
for element in status.split('\n'):
result = re.search(r"\b[A-Z].+", element)
if result:
# eliminate individual 0x05 specification
simple_opcode = re.sub(r'\s(.*)', '', result.group(0))
opcodes_list.append(simple_opcode)
# Count up the values
cnt = collections.Counter()
for word in opcodes_list:
cnt[word] += 1
print(cnt)
# THRESHOLD
threshold = 30
cnt = collections.Counter(record for record in cnt.elements() if cnt[record] >= threshold)
# VISUALIZATION
# Transpose the data to get the x and y values
labels, values = zip(*cnt.items())
# generates this representation: [0 1 2 3 4 5 6 7],
# from the number of the length
indexes = np.arange(len(labels))
width = 1
plt.xlabel("most common opcodes in tx")
plt.ylabel("number of occurances")
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()
我如何迭代上面指定的输入记录,以消除0x4f0DAA112142FFC4BA1B9f3B76bcd238A094D65B_的前缀,然后在Python中将它们呈现为直方图 您可以尝试以下方法:
import re
data = [b for b in [re.split("\s+", i.strip('\n')) for i in open('filename.txt')] if len(b) > 1]
final_data = [[int(re.sub("\w+_", '', a)), int(b)] for a, b in data]
我根据您提供的数据运行了这段代码,得到了以下输出:
[[0, 82557], [1, 128805], [2, 94990], [3, 121020], [4, 58111390], [5, 167079], [6, 130795], [7, 236926], [8, 24754217], [9, 75407], [10, 136461], [11, 136748], [12, 146258], [13, 381091], [14, 129815]]
把这一切放在一起
import re
import re
import numpy as np
import matplotlib.pyplot as plt
import csv
import sys
import pprint
import itertools
import subprocess
import collections
data = [b for b in [re.split("\s+", i.strip('\n')) for i in open('40000_output.txt')] if len(b) > 1]
final_data = [[int(re.sub("\w+_", '', a)), int(b)] for a, b in data]
# VISUALIZATION
# Transpose the data to get the x and y values
labels, values = zip(*final_data)
# generates this representation: [0 1 2 3 4 5 6 7],
# from the number of the length
indexes = np.arange(len(labels))
width = 1
plt.xlabel("most common opcodes in tx")
plt.ylabel("number of occurances")
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()
它本身不是结构化的,只是以我在OP中发布的形式。你是说我应该将其编码为json吗?@s.matthew.english在这种情况下,数据是一个完整的字符串,还是一个字符串列表?它是map reduce作业的输出,只是cat'd到.txtfile@s.matthew.english谢谢你的澄清。请查看我最近的编辑。但是如何将其呈现为直方图?…尝试调整此代码…-你是说你找到了这个,但不是你写的?您想使用所有数据还是仅使用其中的一部分?所有数据。是的,我写的示例数据是一条记录还是十五条记录?每条生产线的_uu000003零件是否唯一?它们是x数据吗?你想要一个有上万条的条形图还是一个有有限个箱子的柱状图,这些箱子聚集了每行的82557部分