Lifting ISA manual reference to the processor simulator
pdf manual isa reference lift to processor
Lifting ISA manual reference to the processor simulator
Introduction
Often, when dedicating considerable time to reverse-engineering complex systems using tools like IDA or BinaryNinja, one begins to conceive ways to streamline the research process. These simplifications might involve utilizing Z3 theorems or delving deeper into function analysis, automating structure generation, and more. Eventually, there comes a point when you decide to create your own analysis environment that incorporates these ideas and functions exactly as you envision.
This project delves into the realm of simplification, stemming from the work involved in creating a disassembler/assembler and lifter for various processors, adopting an automotive-like approach. Here, we discuss the approaches and methodologies employed to achieve this result.
Approaches, how to make your own processor
To understand the behavior of a processor, one must find ways to describe it accurately. After thorough research, I discovered several methods that could aid in this task, which can be categorized into two sections:
- Approaches based on existing projects targeting the same objectives (LLVM/Ghidra/QEMU).
- Approaches leveraging documentation provided by the processor’s vendor.
Approach №1 - Utilizing LLVM’s/GCC’s Processor Definitions
While their tables can be utilized, they lack comprehensive information regarding instruction behavior. These tables were developed for selecting the correct instruction based on the DAG
. Consequently, additional coding is required to define each processor’s instructions’ operands and lift instructions to IR
. Thus, this approach proves to be insufficient.
Approach №2 - Utilizing Ghidra’s Pcode Definitions
Employing Ghidra’s well-established pcode is a sound idea, already implemented in mature projects like remill
. However, these pcode definitions are closely tied to Ghidra’s framework, making it unsuitable as a standalone project without Ghidra. While it contains many intrinsics for unimplemented instructions, requiring implementation, it also covers a significant portion of instructions, making it a viable but not flawless approach.
Approach №3 - Utilizing QEMU’s TCG Generator Definitions
QEMU’s TCG IL serves as a foundation for lifting to other IL. Projects like Revng
and Relyze
utilize QEMU’s TCG. While more comprehensive than pcode in implementing all instructions due to processor emulation, it relies on third-party components like pcode. While efficient for lifting already defined architectures, significant time investment is required for implementing new architectures and developing components like a disassembler/assembler to obtain complete information about operands.
Approach №4 - Referring to ISA References
Referring to a processor’s ISA references proves beneficial, as they provide comprehensive descriptions of processor behavior. For instance, I found ARM’s reference, available as XML definitions for instructions, simplifying reference handling and aiding in building custom definitions based on it.
Surprisingly, other vendors only provide references in PDF format. However, these PDFs contain sufficient information about opcode format definitions and pseudocode for most instructions, facilitating the creation of tables for use in a generator for disassembler and lifter.
To my surprise references of another vendors has been provided only as PDF, Its has enough information about opcodes format definitions and pseudocode for almost all of them. It’s means that is possible to create tables for further using it in a generator for disassembler and lifter.
Pseudocode’s section is keyed part of reference. It is relies on internal names that used in opcode’s definitions. It can be converted to AST and be used to automatically generation of lifter. This looks pretty simple, flexible and reachable as I thought.
What’s useful information can be found in an ISA reference
Let’s see what we can to get from manual references. All references are different but they have an universal information, whether it’s an Intel, MIPS, ARM or even old references of m68k.
Section | What we’ll get |
---|---|
Opcode/Format encoding | Describing instruction format and variables |
Operation | Providing comprehensive descriptions of behavior, often with intuitive language |
Description | Supplementary information useful for generating architecture documentation |
Other/Custom | Detailing operation behavior, such as flag effects or potential exceptions |
As you can see figure below (MIPS 6.6) illustrates how operation descriptions contain operand names used in provided pseudocode.
It’s also a good idea to apply this approach to X86
, given its numerous AVX
and SSE
operations, each with various forms and behaviors. Attempting to manually write lifters for all of these operations would consume a significant portion of one’s lifetime.
How to handle PDF files
Initially, I encountered Intel references by felixcloutier
, along with scripts for handling PDFs on GitHub, such as x86doc
. These scripts utilize pdfminer
to extract information about PDF structure, enabling the creation of sentences and tables from extracted data. While this approach simplifies generating descriptions, manual verification and correction are necessary, particularly due to inconsistencies and misprints in PDFs.
With this knowledge, you can compose objects and create sentences and tables independently. Additionally, most pages within a single PDF adhere to a consistent style in terms of structure, font usage, and other aspects, which simplifies the process of generating descriptions.
However, not all aspects are handled perfectly, even when dealing with references like the x86 reference. After processing and generating a result, it’s essential to thoroughly review and rectify any errors. Therefore, I save the described instructions in a structured JSON format to facilitate this review and correction process.
Main problem of PDF references - It’s a TRAP!
Handling the produced sections after recovering them from the PDF is indeed a routine task. This is because some PDF files exhibit minor differences that the parser cannot handle. For instance, in descriptions for MIPS instructions, processor features may be described intermittently, either above or below the instruction’s format. Additionally, there may be instances of typographical errors or misprints in the text.
As illustrated, an extra comma has been inadvertently added here. It’s common for PDFs to contain several misprints or errors, and these need to be addressed before utilizing the generated descriptions.
Challenges Encountered in Various ISA References
MIPS References
Incorrect Pseudocode: Occasionally, irrelevant content appears in provided pseudocode, challenging correct lexing.
Unconventional Title Naming: Inconsistent title formatting complicates content handling.
Inaccurate Opcode Descriptions: Accurately parsing opcode boxes proves challenging, requiring custom handling.
Table Anomalies: Deviations in table structure confuse parsers, resulting in incorrect outputs.
AVR References
Laziness - Incomplete field naming conventions necessitate custom handling.
Intel References
Inconsistent Pseudocode - Intel documents occasionally contain irrelevant content in pseudocode sections
Here “dot” which counts as part of pseudocode’s section
Multiple Syntax Formats: Intel’s pseudocode exhibits various syntax formats, adding complexity
- First format similar to python language
- Second one simular to pascal
- And the third is like a weird C
Incomplete Pseudocode - If you thought intel hasn’t big mistakes in pseudocode, you are wrong.
Here, you can observe the description of CBW/CWDE/CDQE
, which are instructions designed to convert a low signed integer to a higher signed integer. Following the conversion, the sign of the integer is preserved in the DX
register, while the main part is stored in AX
. However, the provided pseudocode only describes the operation of changing the DX
register and lacks information about AX
.
Achievements
Despite the challenges, descriptions for each instruction outlined in handled PDFs were obtained, covering architectures such as x86
, mips16
, micromips
, mips
, nanomips
, avr8
, avr32
Arch | Parsed instances |
---|---|
x86 |
789 |
mips16 includes ase 16/16e2 |
115 |
micromips(32/64) with ase dsp/mcu/mt/vz |
414 |
mips(32/64) with ase 3d/dsp/mcu/msa/mt/smart/vz |
671 |
nanomips with ase dsp/mt |
299 |
avr8 |
126 |
avr32 |
196 |
m68k |
currently in progress |
As a bonus was taken AArch(32/64)
already defined in xml form.
These descriptions can be easily converted to HTML for use as a fast manual, see arch_index
Conclusion
For the generation process, PDF files containing comprehensive descriptions of processor behavior and information about instruction encoding formats were utilized. It’s evident that all references contain errors and do not guarantee the accuracy of all information. However, ultimately, PDFs provide significantly more information compared to other sources.