Python 如何解析ASM文件并获取操作码

Python 如何解析ASM文件并获取操作码,python,parsing,assembly,Python,Parsing,Assembly,我有一个asm文件,如下所示。如何通过python3解析文件内容并获取操作码,如[“push”、“mov”、…、“call”]?是否有第三个解析器或任何人可以帮助创建正则表达式 .text:00401000 ; Segment type: Pure code .text:00401000 ; Segment permissions: Read/Execute .text:004

我有一个asm文件,如下所示。如何通过python3解析文件内容并获取操作码,如
[“push”、“mov”、…、“call”]
?是否有第三个解析器或任何人可以帮助创建正则表达式

.text:00401000                             ; Segment type: Pure code
.text:00401000                             ; Segment permissions:     Read/Execute
.text:00401000                             _text           segment para public 'CODE' use32
.text:00401000                                     assume cs:_text
.text:00401000                                     ;org 401000h
.text:00401000                                     assume es:nothing, ss:nothing, ds:_data, fs:nothing, gs:nothing
.text:00401000 56                                  push    esi
.text:00401001 8D 44 24 08                             lea     eax, [esp+8]
.text:00401005 50                                  push    eax
.text:00401006 8B F1                                   mov     esi, ecx
.text:00401008 E8 1C 1B 00 00                              call    ??0exception@std@@QAE@ABQBD@Z ; std::exception::exception(char const * const &)
.text:0040100D C7 06 08 BB 42 00                           mov     dword ptr [esi], offset off_42BB08
.text:00401013 8B C6                                   mov     eax, esi
.text:00401015 5E                                  pop     esi
.text:00401016 C2 04 00                                retn    4
.text:00401016                             ; ---------------------------------------------------------------------------
.text:00401019 CC CC CC CC CC CC CC                        align 10h
.text:00401020 C7 01 08 BB 42 00                           mov     dword ptr [ecx], offset off_42BB08
.text:00401026 E9 26 1C 00 00                              jmp     sub_402C51
.text:00401026                             ; ---------------------------------------------------------------------------
.text:0040102B CC CC CC CC CC                              align 10h
.text:00401030 56                                  push    esi
.text:00401031 8B F1                                   mov     esi, ecx
.text:00401033 C7 06 08 BB 42 00                           mov     dword ptr [esi], offset off_42BB08
.text:00401039 E8 13 1C 00 00                              call    sub_402C51
.text:0040103E F6 44 24 08 01                              test    byte ptr [esp+8], 1
.text:00401043 74 09                                   jz      short loc_40104E
.text:00401045 56                                  push    esi
.text:00401046 E8 6C 1E 00 00                              call    ??3@YAXPAX@Z    ; operator delete(void *)
.text:0040104B 83 C4 04                                add     esp, 4

您可以尝试以下方法:

from pyparsing import Word, hexnums, WordEnd, Optional, alphas, alphanums

hex_integer = Word(hexnums) + WordEnd() # use WordEnd to avoid parsing leading a-f of non-hex numbers as a hex
line = ".text:" + hex_integer + Optional((hex_integer*(1,))("instructions") + Word(alphas,alphanums)("opcode"))

for source_line in source:
    result = line.parseString(source_line)
    if "opcode" in result:
        print(result.opcode, result.instructions.asList())
印刷品:

('push', ['56'])
('lea', ['8D', '44', '24', '08'])
('push', ['50'])
('mov', ['8B', 'F1'])
('call', ['E8', '1C', '1B', '00', '00'])
('mov', ['C7', '06', '08', 'BB', '42', '00'])
('mov', ['8B', 'C6'])
('pop', ['5E'])
('retn', ['C2', '04', '00'])
('align', ['CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC'])
('mov', ['C7', '01', '08', 'BB', '42', '00'])
('jmp', ['E9', '26', '1C', '00', '00'])

您没有说您也需要这些说明,但包含它们很容易。

您可以尝试pyparsing:

from pyparsing import Word, hexnums, WordEnd, Optional, alphas, alphanums

hex_integer = Word(hexnums) + WordEnd() # use WordEnd to avoid parsing leading a-f of non-hex numbers as a hex
line = ".text:" + hex_integer + Optional((hex_integer*(1,))("instructions") + Word(alphas,alphanums)("opcode"))

for source_line in source:
    result = line.parseString(source_line)
    if "opcode" in result:
        print(result.opcode, result.instructions.asList())
印刷品:

('push', ['56'])
('lea', ['8D', '44', '24', '08'])
('push', ['50'])
('mov', ['8B', 'F1'])
('call', ['E8', '1C', '1B', '00', '00'])
('mov', ['C7', '06', '08', 'BB', '42', '00'])
('mov', ['8B', 'C6'])
('pop', ['5E'])
('retn', ['C2', '04', '00'])
('align', ['CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC'])
('mov', ['C7', '01', '08', 'BB', '42', '00'])
('jmp', ['E9', '26', '1C', '00', '00'])

您没有说您也需要这些说明,但包含它们很容易。

您是要解析十六进制还是右边的文本?该文件同时包含十六进制和右边的文本。我只想从文件中获取操作码(mov、add、push)。编写一个逐行读取的脚本并简单地剪切第一个字符(多少钱,可能是40个?)怎么样?您是想解析十六进制还是右边的文本?该文件与十六进制和右边的文本结合在一起。我只想从文件中获取操作码(mov、add、push)。编写一个逐行读取的脚本并简单地剪切第一个字符(多少钱,可能是40个?)怎么样?