Table of contents
- The naive approach
- Lifting the
- Finding our stack
- Link it up
- Debugging, but for real this time
- What’s next
- Extra credit
Welcome to the second post in the RISC-V from scratch series! As a quick recap, throughout RISC-V from scratch we will explore various low-level concepts (compilation and linking, primitive runtimes, assembly, and more), typically through the lens of RISC-V and its ecosystem. In the first post of this series, we introduced RISC-V, explained why it’s important, set up the full GNU RISC-V toolchain, and built and ran a simple program on an emulated version of a RISC-V processor with the help of SiFive’s freedom-e-sdk.
freedom-e-sdk made it trivial for us to compile, debug, and run any C program on an emulated or physical RISC-V processor. We didn’t have to worry about setting up any linker scripts or writing a runtime that sets up our stack, calls into
main, and more. This is great if you’re looking to quickly become productive, but these details are exactly the sort of thing we want to learn about!
In this post, we’ll break free from the
freedom-e-sdk. We’ll write and attempt to debug a simple C program of our own, unveil the magic hidden behind
main, and examine the hardware layout of a
qemu virtual machine. We’ll then examine and modify a linker script, write our own C runtime to get our program set up and running, and finally invoke GDB and step through our program.
If you missed the previous post in this series and don’t have
riscv-qemu and the RISC-V toolchain installed and were hoping to follow along, jump to the “QEMU and RISC-V toolchain setup” section (or in RISC-V assembly,
jal x0, qemu_and_toolchain_setup) and complete that before moving on.
The naive approach
Let’s start our journey with a simple C program that infinitely adds two numbers together.
We want to run this program, and the first step on that path is compiling it into something executable by our RISC-V CPU.
This produces a file called
a.out, which is the default name
gcc gives executables when we don’t tell it what we want them to be called. We can now run this brand new executable inside
We’ve chosen the
virt RISC-V machine, which is one
riscv-qemu comes with out of the box.
Now that our program is running inside QEMU with a GDB server on host
localhost and port
1234, let’s connect to it with our RISC-V GDB client from a separate terminal:
And we should now find ourselves inside GDB!
While we could now try to tell GDB to
a.out executable it’s currently pointed at, this won’t work, and for good reason. We used
riscv64-unknown-elf-gcc to compile our program, so unless our host machine is running a
riscv64 CPU it won’t know what to do with a program compiled for that target.
All is lost, right? Of course not! This situation is one major reason for GDB’s client-server model. We can take our
riscv64-unknown-elf-gdb executable which knows how to debug
riscv64 targets and point it at some remote target (a GDB server) instead of running the program on our host machine. As you may recall, we just started
riscv-qemu and told it to start a GDB server on host
localhost and port
1234. Connecting to this server is as easy this:
And now we should be able to set some breakpoints:
And finally tell GDB to
continue with shorthand
c until we hit a breakpoint:
You’ll quickly notice that we just keep continuing, and continuing, and continuing…shouldn’t we hit our
b 5 breakpoint instantly, since we’re simply looping indefinitely on that line? What’s going wrong?
Let’s take a look at our interface, which for me looked like:
There are several red flags here:
- GDB’s TUI is unable to find a source. This should be displaying our code and any nearby breakpoints.
- GDB doesn’t know the current line of execution (
L??), and lists a program counter of 0x0 (
- Some clobbered text in our prompt area that looks like this in its full form:
0x0000000000000000 in ?? ()
These indicators, in combination with the fact that we never hit a breakpoint, signals we have done something wrong. But what is it?
To figure out what’s going on here, we need to take a detour and talk about how our simple C program actually works underneath the surface. We have a function called
main that does our simple addition, but what is
main, really? Why must it be called
main and not
entry? Conventionally we know that all executables start running at
main, but what magic occurs to make this happen?
To answer these questions, let’s re-run our GCC command with the
-v flag to get a more verbose output of what it is actually doing.
There’s quite a lot we get back so we won’t look through it all. The first important thing of note is that even though GCC stands for “GNU C Compiler”,
gcc also by default links our code in addition to compiling and assembling it (
-c tells GCC only to compile and assemble). Why is this relevant? Well, take a look at this snippet pulled from our verbose
I realize that even in a shortened form this is still a lot to look at, so let me explain what’s going on here. On our first line,
gcc is executing a program called
collect2, passing along arguments such as
crtend.o, and flags like
--start-group. We can read about what collect2 is here - in short,
collect2 arranges various initialization functions at start time by making one or more linking passes.
Knowing this, we see that GCC is linking multiple different
crt object files with the code we wrote. As you might guess,
crt stands for “C runtime”. You can read in detail what each crt is for here, but in our case we care the most about
crt0, which has one very important job:
This object [crt0] is expected to contain the
_startsymbol, which takes care of bootstrapping the initial execution of the program.
What exactly this bootstrapping of initial execution is depends on the platform in question, but generally it includes important tasks such as setting up the stack frame, passing along command line arguments, and calling into
main. Yes, we have finally answered the question posed at the beginning of this section - it is
_start who calls into our
Finding our stack
We’ve solved one mystery, but you might be wondering how this gets us any closer to our original goal of being able to step through our simple C program with
gdb. There are a few problems we have left to address, but the first we have has to do with the way
crt0 is setting up our stack.
As we saw above,
gcc links a default
crt0 unless told to do otherwise. This default
crt0 is selected based on several factors, namely:
- Target triplet, which follows the structure of
machine-vendor-operatingsystem. For us, this is
- Target ISA,
- Target ABI,
This may work fine in a general case, but is undoubtedly not going to work for every RISC-V processor. As mentioned previously, one of
crt0s jobs is to set up the stack, but how can it do that if it doesn’t know where the stack should be for the CPU (
-machine) we’re running against? Answer: it can’t, at least not without us giving it a bit of assistance.
Circling back to the
qemu command we ran at the beginning of this post (
qemu-system-riscv64 -machine virt -m 128M -gdb tcp::1234 -kernel a.out), recall we were using the
virt machine. Fortunately for us,
qemu exposes a simple way to dump information about a machine in
dtb (device tree blob) format.
dtb format is difficult to read considering it’s mostly binary, but there is a command-line tool called
dtc (device tree compiler) that can convert it into something more human-readable.
This gives us a file called
riscv64-virt.dts, which has lots of interesting information about
virt such as the number of available CPU cores, the memory location of various peripherals such as the UART, and the memory location of the onboard memory (RAM). We want our stack to live inside this memory, so let’s
grep for it:
We see that the
device_type for this node is “memory”, which means we’ve probably found what we’re looking for. Using the values inside
reg = <...>;, we can also determine where this memory bank starts and how long it is.
Referencing the devicetree specification, we see that the syntax for
reg is an arbitrary number of
(base_address, length) pairs. However, there are four values inside
reg - shouldn’t there only be two necessary to define our singular memory bank?
Again referencing the devicetree specification (search for “Property name: reg”), we learn that the number of
<u32> cells required to specify the address and length is determined by the
#size-cells properties in the parent of node (or in the node itself). These values aren’t specified in our
memory node, and the parent of the
memory node is simply the root portion of the file, so let’s look there for these values:
And there we have it - it takes two 32-bit values (cells) to specify an address, and two 32-bit values to specify length. This means, given
reg = <0x00 0x80000000 0x00 0x8000000>;, our memory begins at
0x00 + 0x80000000 (
0x80000000) and extends
0x8000000) bytes, meaning it ends at
0x88000000. In more human-friendly terms, we can use a hexadecimal calculator to determine that our length of
0x8000000 bytes is 128 megabytes.
Link it up
dtc, we’ve successfully discovered where the RAM lives and how long it extends in our
virt virtual machine. We also know that
gcc is linking a default
crt0 that isn’t setting up our stack the way we need it to. But what exactly do we do with this information, and how does it get us any closer to getting a running, debuggable program?
Well, since the default
crt0 isn’t doing what we need it to, we have one obvious choice: writing our own, and then linking it with the object file created from compiling our simple addition program. Our
crt0 will need to know where the top of the stack starts in order to properly initialize it. We could hardcode this value to
0x80000000 directly in our
crt0, but that isn’t a very maintainable solution. What happens when we want to use a different
qemu-lated CPU, such as the
sifive_e, that has different memory properties?
Fortunately for us, we are far from the first to ask this question, and a good solution exists. GNU’s linking program,
ld, provides a way for us to define a symbol which would be accessible from our
crt0. We can use this, among other functions provided by
ld, to create a
__stack_top symbol definition that is reasonably flexible across multiple different CPUs.
Rather than writing our own linker file from scratch, it is going to make more sense to take the default linker script that
ld uses and modify it slightly to expose any additional symbols we want. What is a linker script, you might be wondering? This snippet summarizes it well:
The main purpose of the linker script is to describe how the sections in the input files should be mapped into the output file, and to control the memory layout of the output file.
Knowing this, let’s copy the default linker script
riscv64-unknown-elf-ld uses into a new file:
There’s a lot of interesting information in this file, much more than we can review in this post. The
--verbose output of the command we just ran includes information about the version number of
ld, supported architectures, and more. This is all good to know, but not valid linker script syntax, so let’s remove it. Using the text editor of your choice, edit
riscv64-virt.ld and remove everything above and including the line of equals signs.
With that out of the way, the first thing we’ll want to do is make use of the MEMORY command to tell the linker we want to take manual control of the memory layout. This paves the way for us to be able to correctly define where
__stack_top should be. Search for the line that starts with
OUTPUT_ARCH(riscv), which should be towards the top of the file, and below it add our
With this, we’ve created a block of memory called
RAM that is readable (
r), writable (
w), and capable of containing executable code (
Great, so we’ve defined a memory layout to match the specifications of our
virt RISC-V machine. But this RAM memory block isn’t much use unless we actually do something with it. We want our stack to live in RAM, so let’s use this RAM definition to create
Defining this symbol is easy. Open the custom linker script (
riscv64-virt.ld) back up in the editor of your choice and make the following addition:
As you can see, we use the PROVIDE command to define a symbol called
__stack_top will be accessible from any program linked with this script (assuming the program itself does not also define something named
__stack_top). We set the value of
__stack_top to be
ORIGIN(RAM), which we know is
LENGTH(RAM), which we know is 128 megabytes (
0x8000000 bytes). This means our
__stack_top is set to
For brevity’s sake I won’t include the entire linker file here, but if you want the final product you can view/download it here: https://twilco.github.io/assets/ld/riscv64-virt.ld
We finally have all we need to create a custom C runtime that works for us, so let’s get started. What we need is actually very simple - here is
crt0.s in its entirety:
The first thing you may notice is that there are a lot of lines that begin with a
.. This is an assembly file, meaning the program that will be looking at this is the assembler, which in the GNU world is the
as executable. The lines that begin with
.s are called assembler directives, which provide information to the assembler rather than acting as executable code like our RISC-V assembly instructions, such as
With this knowledge in mind, let’s run through this file line-by-line. We’ll be working with various RISC-V standard registers, so check out this table for a good overview of each of them and their purpose.
Referencing the GNU ‘as’ manual, this line tells the assembler that we want the following code to go into a section named
.init that is
allocatable and e
.init section is another commonly followed convention for running your code within the confines of an operating system. We’re running on bare metal with no OS, so this may not be totally necessary in our case, but it’s good practice regardless.
.global is important because it makes the symbol following it available to
ld. Without this, linking would fail because we told it via the
ENTRY(_start) command in our linker script that it should look for the
_start symbol when establishing the entry point to our executable.
_start: tells the assembler we are beginning the definition of a symbol called
.cfi directives inform tools, such as the assembler or exception unwinder, about the structure of the frame and how to unwind it.
.cfi_endproc signal the start and end of a function, and
.cfi_undefined ra tells the assembler that register
ra should not be restored to whatever value it contained before
.option directives modify assembler behavior inline with assembly code, which is useful when particular instruction sequences must be assembled with a specific set of options. This link describes in good detail why this is important for the code segment above, so I’ll quote it directly:
…since we relax addressing sequences to shorter GP-relative sequences when possible, the initial load of GP must not be relaxed and should be emitted as something like:
in order to produce, after linker relaxation, the expected:
instead of just:
And now for the last portion of our
Here we finally make use the
__stack_top symbol we worked so tirelessly to create.
la, or the “load address” pseudoinstruction (which will be further explained below), loads the value of
__stack_top into the
sp (stack pointer) register, setting it up to be used for the remainder of our program.
add s0, sp, zero adds together the value of the
sp register with the value of the
zero register (which is actually the
x0 register, hardwired to 0), and finally places it into the
s0 is a special register in a few ways. First, it is what is known as a “saved register” meaning it is preserved across function calls. Second,
s0 sometimes acts as the frame pointer, which enables each function invocation to maintain it’s own little space on the stack for storing parameters passed into that function. How function calls work with the stack and frame pointers is a very interesting subject and could easily be a full-length post on it’s own, but for now just know that initializing our frame pointer
s0 is an important task for our runtime.
The next instruction we see is
jal zero, main.
jal stands for “
link”, and expects operands in the form of
jal rd (destination register), offset_address. Functionally,
jal writes the value of the next instruction (the
pc register plus four) to the
rd, and then sets the
pc register to the current value of
pc plus the sign-extended offset address, effectively “calling” that address.
As mentioned above,
x0 is hardwired to the literal value of
0, and writes to it have no effect. With this in mind, you may think it’s odd that we’re using the
zero register, which RISC-V assemblers interpret as the
x0 register, as our destination register, since this effectively creates an unconditional and side-effect free jump to
offset_address. Why do it this way, you may wonder…don’t other ISAs have an explicit unconditional jump instruction?
jal zero, offset_address pattern is actually a clever optimization enabled by the dedication of one whole register to a hard-wired zero. Each supported instruction means a larger, and therefore more expensive, processor, so the simpler the ISA the better. Rather than polluting the instruction space with both
unconditional jump instructions, the RISC-V ISA only calls for
jal, but through
jal zero, main supports unconditional jumps.
There are many, many similar optimizations in RISC-V, most taking the form of what are known as pseudoinstructions. Pseudoinstructions are instructions that assemblers know how to translate to other actual hardware-implemented instructions. For example, there is an unconditional jump pseudoinstruction
j offset_address, which RISC-V assemblers translate to
jal zero, offset_address. For a full list of officially supported pseudoinstructions, search for pseudoinstruction here in v2.2 of the RISC-V spec.
Our very last line is an assembler directive,
.end, which simply marks the end of the assembly file.
Debugging, but for real this time
To recap, we’ve worked through many problems in our quest of debugging a simple C program on a RISC-V processor. We first used
dtc to find where our memory was located in the
virt virtual RISC-V machine. We then used this information to take manual control of the memory layout in our customized version of the default
riscv64-unknown-elf-ld linker script, which then enabled us to accurately define a
__stack_top symbol. We finished by using this symbol in our own custom
crt0.s that set up our stack and global pointers and finally called the
main function. Let’s make use of all this work to complete our original goal of debugging our simple C program in GDB.
As a reminder, here was our program:
And now to compile and link:
You’ll notice we have specified a lot more flags than we did last time, so let’s walk through all the ones we didn’t cover in the first section.
-ffreestanding tells the compiler that the standard library may not exist, and therefore not to make assumptions that it will be there. This option isn’t necessary when running your application in a hosted environment (within an OS), but we aren’t doing that, so it’s important to tell the compiler that information.
-Wl is a comma-separated list of flags to pass on to the linker (
--gc-sections stands for “garbage collect sections”, and tells
ld to remove unused sections post-link.
-nodefaultlibs respectively tell the linker not to link in any standard system startup files (such as the default
crt0), any standard system stdlib implementation, or any standard system default linkable libraries. We are providing our own
crt0 and linker script, so it’s important to pass these flags to inform the compiler we don’t want any of these defaults to avoid conflict with our custom setup.
-T allows you to specify the path to your linker script, which in our case is simply
riscv64-virt.ld. Finally, we specify the files we wish to compile, assemble, and link:
add.c. As with before, this all results in a fully-fledged and ready-to-run executable called
We’ll now start our shiny new executable in
And then start
gdb, making sure to load the debug symbols for
a.out by specifying it as our last argument:
Next, we’ll connect our
gdb client to the
gdb server we started as part of our
Set a breakpoint in
And start execution of the program:
You’ll notice from the above output that we have successfully hit a breakpoint on line 2! Our text interface also shows this via the
B+> gutter annotation, and we finally have a proper line
PC: value -
PC: 0x8000001e. If you’ve been following along, yours might look something like this:
From here we can use
gdb as normal -
s to step to the next instruction,
info all-registers to inspect the values inside our registers as our program executes, so on and so forth. Experiment to your hearts content…we certainly worked hard enough to get here!
We accomplished, and hopefully learned, a lot today! I’ve never had a formal plan for this series, instead simply following whatever is most interesting to me at each moment, so I’m not sure exactly sure what will come next. I particularly enjoyed the deep dive we took into the
jal instruction, so perhaps in our next post we’ll build upon the foundation we created here but instead replace
add.c with some pure RISC-V assembly program. If you have something in particular you’d like to see or have any questions about what we covered here, let me know by opening an issue at https://github.com/twilco/twilco.github.io/issues. After the next post in the series is complete I’ll link to it below.
Thanks for reading, and hope to see you in the next post!
If you enjoyed this post and want event more, Matt Godbolt gave a presentation titled “The Bits Between the Bits: How We Get to main()” at CppCon2018 that approaches this subject from a few different angles than we took here in this post. If you’ve worked through the entirety of this post you will definitely recognize some of the things he covers. It’s a good talk, so check it out!