Categories
advpi

advpi – Week 6 – Code Execution

This week was hyper-focused on getting normal, proper code execution.

Code layout

The first thing to get right was the code execution.

I found this wonderful project kvm-arm32-on-arm64 that provided a style for doing this.

So, it turns out that ARM converter prints the code in the format its in memory. So, to write it correct, I used the big endian format and then stored it in code in the uint32_t format. This way, the compiler will set it to little endian format as required, and I don’t need to think about endianness for a while.

So if I take this bit of ARM code, that loops,

nop
nop
nop
mov r0,0x0
mov r1,0x1000
mov r2,0x1
add r3,r1,r2
str r3,[r1]
mov r15,0x2000000

We convert it to the following array.

uint32_t CODE[] = {
    0xE320F000,
    0xE320F000,
    0xE320F000,
    0xE3A00000,
    0xE3A01A01,
    0xE3A02001,
    0xE0813002,
    0xE5813000,
    0xE3A0F402,
};

I would expect it to exit from MMIO (as it attempts to write to a read only page) at 0x1000, and exit per the program. If it loops or does anything else, something is wrong.

Thankfully, we get this wonderful(ly confusing) output:

Hello, from advpi!
Opened bios
Attempted mmio
Attempted write=yes of value=4097 and at address=4096
Register(0)=0
Register(1)=0
Register(2)=0
Register(3)=0
Register(4)=0
Register(5)=0
Register(6)=0
Register(7)=0
Register(8)=0
Register(9)=0
Register(10)=0
Register(11)=0
Register(12)=0
Register(13)=0
Register(14)=0
Register(15)=0
Closing the Virtual Machine

So, the MMIO output was correct, but we’re not able to see the registers. I’m thinking it is more related to the exception levels and register banks that are available, but I shall check more on it as required.

So although technically we can’t see the registers, we can get some rudimentary output through MMIO, or shared memory.

If I remove the MMIO causing statement, we get an endless loop – showing that we’re stopping as required.

What’s next

This week was short as well. Following this, I’ll be setting up a GBA cart, and the video device. I’m thinking of using SDL2 to show a video device and handle device I/O.

Categories
advpi

advpi – Week 3,4,5 – Mapping the BIOS and shifting to C++

This week was a non-amusing transfer from C to C++ and adding the GBA BIOS.

BIOS Map

The Nintendo BIOS contains the basic code needed to initalize and use the GBA. After that the BIOS hands over control to the game that’s plugged in. My initial thought was to load the file into the game, then I remembered the golden word which haunted me last week – mmap. It was used to create the memory which is mapped to the guest VM.

So, I mmap-ed the BIOS file into the memory, then added it in.

void* biosRom =
        mmap(0, BIOS_SIZE, PROT_READ | PROT_EXEC, MAP_SHARED, biosFd, 0);

Now its upto the kernel to manage it, and not my headache (for now).

Then I tried mapping the BIOS into the memory of the VM, and it segfaulted instantly. I changed some of the parameters to allow writes, and then it didn’t segfault, but the mmap didn’t go through.

After looking into the documentation, here’s the mistake I was mapping.

    struct kvm_userspace_memory_region memory_region = {
        .slot = 0,
        .userspace_addr = (unsigned long long)gbaMemory->onboardMemory,
        .guest_phys_addr = 0x02000000,
        .memory_size = ONBOARD_MEM_SIZE};

I was setting the slot to 0 for both the onboard memory (or where I was putting my code), and the BIOS page. They are actually different slots of memory, and need to be initialized as separate slots.

C++

I’m not used to C (and KVM and any of this) and it shows, so its probably a good idea to shift to C++ while I still can.

I wanted exceptions to handle unexpected failures, but I also didn’t want to model for releasing resources – a job better left for the compiler. C++ seems to be a better idea. I removed all the GOTOs, and got around to C++, then finally put an exception that helped me with my sanity.

class InitializationError : public std::exception {
    private:
    std::string message;
    public:
    InitializationError(std::string);
};

Being able to use exceptions along with constructors is useful. I’m aware that there’s a performance penalty, but it should be fine as long as I spend minimal time processing in my code (the kernel handles running the guest VM, not my code).

What’s next

The registers are still seemingly useful and garbage at the same time, but we shall see. Next week will mostly be travel and continued refactoring while I try to learn more, so next week will be week 5 essentially.

Categories
advpi

advpi – Week 2 – Initialization and Code

This week is considering three things:

  1. Checking out KVM and working with it on x64
  2. Setting up the Raspberry PI workspace
  3. Hello world in RPI.

KVM – x64

While searching the interwebs, I followed this (amazing?) tutorial that explains how the KVM API for Linux works.

I created a C project and followed it up, and this is the basic breakdown:

  1. Open the KVM API.
  2. Create a VM
  3. Query and assert that your device supports all the APIs you will call.
  4. Create a vCPU (The thing that will run your code)
  5. Create memory mappings
  6. Put some code in, set the vCPU registers
  7. Run the vCPU

What did I make it run? 2+2=4 of course. Here’s the program.

const uint8_t code[] = {
    0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */
    0x00, 0xd8,       /* add %bl, %al */
    0x04, '0',        /* add $'0', %al */
    0xee,             /* out %al, (%dx) */
    0xb0, '\n',       /* mov $'\n', %al */
    0xee,             /* out %al, (%dx) */
    0xf4,             /* hlt */
};

Here, the sum of registers a and b is output to the serial output at 0x3f8.

The fun part is that in KVM, output is a special exit, and KVM yielded us control back.

switch (vcpuKvmRun->exit_reason) {
    case KVM_EXIT_IO:
                if (vcpuKvmRun->io.direction == KVM_EXIT_IO_OUT &&
                    vcpuKvmRun->io.size == 1 && vcpuKvmRun->io.port == 0x3f8 &&
                    vcpuKvmRun->io.count == 1)
                    putchar(
                        *(((char*)vcpuKvmRun) + vcpuKvmRun->io.data_offset));
...

Summarizing what happens above, we “emulate” serial output device, by printing out what the VM wants to be printed, onto the console.

Cool, it works. We’re ready to start porting this onto the RPI.

Wait, how do I work with a RPI PI Zero 2 W?

Device setup

I had just bought this device. So let’s see what we need.

  1. Some way to work with the project.
  2. Running and testing on the RPI.

To run the project, I thought the easiest way would be to cross-compiling. So let’s get cross-compiling setup in Fedora.

And checking on Google, there’s a gcc arm64 cross compiler package:

Well, that’s a no then. Its not possible to cross compile actual userspace executables for ARM64 on x64 machines.

Then, let’s run it on the PI itself.

So, I decided that using some kind of remote development server is good. I tried VSCode, and the poor PI Zero 2 W collapsed given the weight of the remote server.

Well, then zed comes to the rescue. With a bit of tweaking, I connect to the PI over Tailscale, and open it using zed. Perfect.

I’m missing out on debugging, which will make it a pain, but that’s okay. Been through worse. Maybe it’ll finally push me to learn gdb.

Getting to run the Thing

Once that’s in and all things are done, we get to the first bit – compile errors all over the place. Some things are wrong.

Since the KVM API is different, the register names and the structures are different. Now it won’t compile.

KVM_GET_ALL_REGS and the corresponding don’t work. For some reason, its required to use KVM_SET_ONE_REG (and the corresponding individual get register).

Okay, all fixed.

There’s still a few things to fix. Namely, starting the CPU in 32 bit mode, and setting the initial PC and code.

Starting in 32 bit

The Cortex A53 starts in 64 bit as it is… a 64 bit processor. How do I instruct it to start in 32 bit mode? Rummaging through the KVM docs (Ctrl+F for “32-bit”), turns out before starting the CPU, I also need to initialize it with featuresets.

Specifically, we need to do this bit:

struct kvm_vcpu_init cpuInit = {};
    int preferredTarget = ioctl(vmFd, KVM_ARM_PREFERRED_TARGET, &cpuInit);
    cpuInit.features[0] |= 1 << KVM_ARM_VCPU_EL1_32BIT;
    int vcpuInitResult = ioctl(this->fd, KVM_ARM_VCPU_INIT, &cpuInit);

Why is it features[0]? I’m not sure I want to know why its addressed that way after spending way too much time trying to find out. The initialization also gives a bunch of useful features to initialize based on what’s required for us – I don’t want them as of now.

Now that the CPU is ready, something needs to be there to execute code.

Executable code

The days of writing code in assembly is gone (for the regular Software Engineer), but now, we need it again. Regular assemblers are too complex (for me as of now) to generate only the tiny snippets I need. This awesome online tool – “Online ARM to HEX Converter” does it all online. I put a small piece of code and let’s see what it does:

Wonderful – something. So I put this in an array and hope it works.

After all this, it still didn’t work (hint: endianness).

I got an MMIO, but was confused what it meant so I added this hideous snippet:

for (int r = 0; r < 16; r++){
  printf("Got Registers: %d(%ld)\n", r,
    getRegisterValue(gameboyKvmVM.vcpuFd, r));
}

We get this sad output and the program hangs; and I have no idea what’s going on:

Hello, from advpi!
Attempted mmio
Got Registers: 0(0)
Got Registers: 1(5000)
Got Registers: 2(6000)
Got Registers: 3(0)
Got Registers: 4(25)
Got Registers: 5(0)
Got Registers: 6(4108)
Got Registers: 7(0)
Got Registers: 8(0)
Got Registers: 9(0)
Got Registers: 10(0)
Got Registers: 11(0)
Got Registers: 12(0)
Got Registers: 13(0)
Got Registers: 14(0)
Got Registers: 15(0)

So the registers were loaded correctly – and the code here was executed! But then R15 is 0? and the next LDR also didn’t generate an MMIO. Strange.

What’s next

As the week comes to a close, I’m getting a bit annoyed at managing resources, and would like (at least a bit of) help from the language. I’ve been coding in a “write the code then pray it works” – and have been ignoring cleanup, refactoring and reading the code again. Let’s take the time to do that, fix any issues. Hopefully by then I fix any faults in the program by cleaning up, or atleast learn more by then so I can.

Categories
advpi Software

advpi – Week 1 – Intro

Why?

I recently bought a Nintendo DS, and a Raspberry PI. Turns out the Nintendo DS can run Game Boy Advance (GBA) games, because they have the same CPU architecture.

NDS and a RPI Zero 2W

Well, the Raspberry PI isn’t too far off too. Does it mean I can just run Gameboy Advance Games on the RPI too?

Well, that’s what this series will go to show. This week is about understanding feasibility and what I will be using (and why). The later weeks should explore things in a bit more technical detail.

Can it GameBoy Advance?

Let’s see what we need, to run it on the Raspberry Pi Zero 2 W. I’ve chosen it because its very small, but runs the same chip as the RPi 1.

This beautiful article “Game Boy Advance Architecture” by Rodrigo Copetti explains what’s required for a GBA, but, let’s break it down and compare it to the RPI Zero 2 W.

Also, let’s think about the weird stuff like Display, Audio and Input later. Being able to run stuff is more important.

GBA Architecture – Rodrigo Copetti

So, let’s go by this order:

  1. CPU
  2. Memory
  3. I/O
CPU

The Zero 2 W has CPU with 4 Cortex A53 cores, whereas the GBA used a ARM7TDMI chip. The A53 might be ancient, but the ARM7TDMI is comparatively prehistoric.

But, delving into the A53 Technical Reference Manual, it is entirely backwards compatible with the ARM7TDMI! The AArch32 instruction set, seems to be a renamed edition of what the older GBA CPU used to use.

Awesome, so the RPI CPU can natively run all these instructions, no problems.

Memory
Memory – Rodrigo Copetti

Oof, it gets complicated here. There’s few things:

  1. The AGB RAMs – They’re basically two different RAM chips, one smaller (32KB) and faster chip, and one larger (96KB) and slower chip.
  2. The even slower (but massive) EWRAM
  3. GamePAK memory – the Game Data

All of these items are laid out in the memory address space for the GBA CPU to access.

The GBA CPU accesses data through 32-bit addresses, and there’s a full address-layout map for fetching and storing data into these addresses.

Here’s the memory map from the no$gba documentation. (Ignore the Display Memory and I/O registers for now)

Intuitively, what this means is when the CPU tries to load data from 0x2000000, it loads the first byte of data from On-board WRAM. If it tries to use 0x3000000, it takes the first byte of data from the On-Chip Work RAM.

The RPI, unfortunately, has none of these mappings. It does however, has 512MB of RAM, which completely runs circles around what the GBA has for memory. Perhaps if there’s a way to create or simulate these mappings for the CPU to use?

I/O

We ignored the Display and IO memory mapping then, so we can discuss about it now. The display and the I/O is all memory-mapped. This means the data is directly available and usable via the memory.

The RPI, again, has none of this I/O. But it does have a lot of GPIO pins and display capabilities already, perhaps it can be “massaged” into the right shape and plugged into the GBA memory layout?

The initial hypothesis – Virtualization

Well, we had three finding:

  • The GBA and RPI can both execute ARM code
  • They differ in IO.
  • They differ in memory mapping.

Well, in these cases, people usually use virtualization. Hmm… That seems awfully inefficient. I might as well use an existing emulator then, where it’ll also replicate the behaviour of the CPU. Although…

Hardware-Assisted Virtualization

Hardware assisted virtualization is a featureset of newer CPUs, where the CPUs can run guest code directly, skipping the need to emulate CPUs in software. Intel provides HAXM, AMD provides AMD-V, and ARM provides… well, something similar, but not sure what the name is. Let’s see how to use it in practi-

Terror

Looking into how to use it. I found – Trapping and Emulation of Instructions and ARM Virtualization Overview.

How I felt reading those documents

No, I can’t understand any of that. Understandable.

Looking online how it works, I remembered/found this neat thing – KVM. Linux takes all these features, and abstracts these architectures into one API, the KVM API. This allows Linux Hosts to run Guest machines with Hardware Accelerated execution.

KVM allows setting memory spaces, setting device specifications and runs code on a virtualized CPU that runs on the machine using the Virtualization Extensions on the CPU. Anything devices need virtualization? KVM gives control back to you to implement it yourself.

What’s better, the RPI Zero 2 W supports KVM.

$ ls /dev/kvm

crw-rw---- 1 root kvm 10, 232 Dec 7 19:29 /dev/kvm

The Final Problem Statement

So, let’s distill all of this into two questions:

  1. Can the GBA be run as a virtual machine on the Raspberry PI Zero 2 W?
  2. Is there any benefit to virtualization, instead of complete emulation?

Thus, the project advpi is planned to run GBA code on the PI Zero 2 W, by using a virtual machine.

The IO will need to be adapted and emulated by the software as well.

Display, can be showed on a window on-screen, and the sound can be played on the pi as it has sound capabilities as well (given a speaker is attached to it).

Anything else, will be figured out as it comes. Let’s see how that goes.

What’s next week?

  • Setting up a workspace on a RPI.
  • Testing the KVM API to run some x86 code, as it is well-documented.