Lifting ISA manual reference to the processor simulator

Introduction

Often, when dedicating considerable time to reverse-engineering complex systems using tools like IDA or BinaryNinja, one begins to conceive ways to streamline the research process. These simplifications might involve utilizing Z3 theorems or delving deeper into function analysis, automating structure generation, and more. Eventually, there comes a point when you decide to create your own analysis environment that incorporates these ideas and functions exactly as you envision.

This project delves into the realm of simplification, stemming from the work involved in creating a disassembler/assembler and lifter for various processors, adopting an automotive-like approach. Here, we discuss the approaches and methodologies employed to achieve this result.

Approaches, how to make your own processor

To understand the behavior of a processor, one must find ways to describe it accurately. After thorough research, I discovered several methods that could aid in this task, which can be categorized into two sections:

  1. Approaches based on existing projects targeting the same objectives (LLVM/Ghidra/QEMU).
  2. Approaches leveraging documentation provided by the processor’s vendor.

Approach №1 - Utilizing LLVM’s/GCC’s Processor Definitions

While their tables can be utilized, they lack comprehensive information regarding instruction behavior. These tables were developed for selecting the correct instruction based on the DAG. Consequently, additional coding is required to define each processor’s instructions’ operands and lift instructions to IR. Thus, this approach proves to be insufficient.

llvm td definations

Approach №2 - Utilizing Ghidra’s Pcode Definitions

Employing Ghidra’s well-established pcode is a sound idea, already implemented in mature projects like remill. However, these pcode definitions are closely tied to Ghidra’s framework, making it unsuitable as a standalone project without Ghidra. While it contains many intrinsics for unimplemented instructions, requiring implementation, it also covers a significant portion of instructions, making it a viable but not flawless approach.

ghidra pcode

Approach №3 - Utilizing QEMU’s TCG Generator Definitions

QEMU’s TCG IL serves as a foundation for lifting to other IL. Projects like Revng and Relyze utilize QEMU’s TCG. While more comprehensive than pcode in implementing all instructions due to processor emulation, it relies on third-party components like pcode. While efficient for lifting already defined architectures, significant time investment is required for implementing new architectures and developing components like a disassembler/assembler to obtain complete information about operands.

Approach №4 - Referring to ISA References

Referring to a processor’s ISA references proves beneficial, as they provide comprehensive descriptions of processor behavior. For instance, I found ARM’s reference, available as XML definitions for instructions, simplifying reference handling and aiding in building custom definitions based on it. arm documentation download

Surprisingly, other vendors only provide references in PDF format. However, these PDFs contain sufficient information about opcode format definitions and pseudocode for most instructions, facilitating the creation of tables for use in a generator for disassembler and lifter.

To my surprise references of another vendors has been provided only as PDF, Its has enough information about opcodes format definitions and pseudocode for almost all of them. It’s means that is possible to create tables for further using it in a generator for disassembler and lifter.

Pseudocode’s section is keyed part of reference. It is relies on internal names that used in opcode’s definitions. It can be converted to AST and be used to automatically generation of lifter. This looks pretty simple, flexible and reachable as I thought.

What’s useful information can be found in an ISA reference

Let’s see what we can to get from manual references. All references are different but they have an universal information, whether it’s an Intel, MIPS, ARM or even old references of m68k.

Section What we’ll get
Opcode/Format encoding Describing instruction format and variables
Operation Providing comprehensive descriptions of behavior, often with intuitive language
Description Supplementary information useful for generating architecture documentation
Other/Custom Detailing operation behavior, such as flag effects or potential exceptions

As you can see figure below (MIPS 6.6) illustrates how operation descriptions contain operand names used in provided pseudocode. mips doc

It’s also a good idea to apply this approach to X86, given its numerous AVX and SSE operations, each with various forms and behaviors. Attempting to manually write lifters for all of these operations would consume a significant portion of one’s lifetime.

intel doc

How to handle PDF files

Initially, I encountered Intel references by felixcloutier, along with scripts for handling PDFs on GitHub, such as x86doc. These scripts utilize pdfminer to extract information about PDF structure, enabling the creation of sentences and tables from extracted data. While this approach simplifies generating descriptions, manual verification and correction are necessary, particularly due to inconsistencies and misprints in PDFs.

With this knowledge, you can compose objects and create sentences and tables independently. Additionally, most pages within a single PDF adhere to a consistent style in terms of structure, font usage, and other aspects, which simplifies the process of generating descriptions.

However, not all aspects are handled perfectly, even when dealing with references like the x86 reference. After processing and generating a result, it’s essential to thoroughly review and rectify any errors. Therefore, I save the described instructions in a structured JSON format to facilitate this review and correction process.

pdf_handling1

Main problem of PDF references - It’s a TRAP!

Handling the produced sections after recovering them from the PDF is indeed a routine task. This is because some PDF files exhibit minor differences that the parser cannot handle. For instance, in descriptions for MIPS instructions, processor features may be described intermittently, either above or below the instruction’s format. Additionally, there may be instances of typographical errors or misprints in the text.

mistake_example

As illustrated, an extra comma has been inadvertently added here. It’s common for PDFs to contain several misprints or errors, and these need to be addressed before utilizing the generated descriptions.

Challenges Encountered in Various ISA References

MIPS References

Incorrect Pseudocode: Occasionally, irrelevant content appears in provided pseudocode, challenging correct lexing. mips mistake 1

mips mistake 4

Unconventional Title Naming: Inconsistent title formatting complicates content handling. mips mistake 2

Inaccurate Opcode Descriptions: Accurately parsing opcode boxes proves challenging, requiring custom handling. mips mistake 3

Table Anomalies: Deviations in table structure confuse parsers, resulting in incorrect outputs. mips mistake 5

AVR References

Laziness - Incomplete field naming conventions necessitate custom handling.
avr mistake 1

Intel References

Inconsistent Pseudocode - Intel documents occasionally contain irrelevant content in pseudocode sections intel mistake 1

Here “dot” which counts as part of pseudocode’s section intel mistake 2

Multiple Syntax Formats: Intel’s pseudocode exhibits various syntax formats, adding complexity

  • First format similar to python language intel mistake 3.1
  • Second one simular to pascal intel mistake 3.2
  • And the third is like a weird C intel mistake 3

Incomplete Pseudocode - If you thought intel hasn’t big mistakes in pseudocode, you are wrong.

Here, you can observe the description of CBW/CWDE/CDQE, which are instructions designed to convert a low signed integer to a higher signed integer. Following the conversion, the sign of the integer is preserved in the DX register, while the main part is stored in AX. However, the provided pseudocode only describes the operation of changing the DX register and lacks information about AX. intel mistake 4

Achievements

Despite the challenges, descriptions for each instruction outlined in handled PDFs were obtained, covering architectures such as x86, mips16, micromips, mips, nanomips, avr8, avr32

Arch Parsed instances
x86 789
mips16 includes ase 16/16e2 115
micromips(32/64) with ase dsp/mcu/mt/vz 414
mips(32/64) with ase 3d/dsp/mcu/msa/mt/smart/vz 671
nanomips with ase dsp/mt 299
avr8 126
avr32 196
m68k currently in progress

As a bonus was taken AArch(32/64) already defined in xml form.

These descriptions can be easily converted to HTML for use as a fast manual, see arch_index

Conclusion

For the generation process, PDF files containing comprehensive descriptions of processor behavior and information about instruction encoding formats were utilized. It’s evident that all references contain errors and do not guarantee the accuracy of all information. However, ultimately, PDFs provide significantly more information compared to other sources.